| Title: | Accessing and Validating Marine Environmental Data from 'SHARK' and Related Databases |
|---|---|
| Description: | Provides functions to retrieve, process, analyze, and quality-control marine physical, chemical, and biological data. The main focus is on Swedish monitoring data available through the 'SHARK' database <https://shark.smhi.se/en/>, with additional API support for 'Nordic Microalgae' <https://nordicmicroalgae.org/>, 'Dyntaxa' <https://artfakta.se/>, World Register of Marine Species ('WoRMS') <https://www.marinespecies.org>, 'AlgaeBase' <https://www.algaebase.org>, OBIS 'xylookup' web service <https://iobis.github.io/xylookup/> and Intergovernmental Oceanographic Commission (IOC) - UNESCO databases on harmful algae <https://www.marinespecies.org/hab/> and toxins <https://toxins.hais.ioc-unesco.org/>. |
| Authors: | Markus Lindh [aut] (Swedish Meteorological and Hydrological Institute, ORCID: <https://orcid.org/0000-0002-7120-4145>), Anders Torstensson [aut, cre] (Swedish Meteorological and Hydrological Institute, ORCID: <https://orcid.org/0000-0002-8283-656X>), Mikael Hedblom [ctb] (Swedish Meteorological and Hydrological Institute, ORCID: <https://orcid.org/0009-0007-5124-9956>), Bengt Karlson [ctb] (Swedish Meteorological and Hydrological Institute, ORCID: <https://orcid.org/0000-0002-7524-3504>), Peter Thor [ctb] (Swedish University of Agricultural Sciences, ORCID: <https://orcid.org/0000-0002-2603-2284>), Marie Johansen [ctb] (Swedish Meteorological and Hydrological Institute), SHARK [cph], SBDI [fnd] (Swedish Research Council, 2019-00242) |
| Maintainer: | Anders Torstensson <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 1.2.0 |
| Built: | 2026-06-08 20:27:38 UTC |
| Source: | https://github.com/sharksmhi/shark4r |
This function enhances a dataset of AphiaIDs (and optionally scientific names) with their complete hierarchical taxonomy from the World Register of Marine Species (WoRMS). Missing AphiaIDs can be resolved from scientific names automatically.
add_worms_taxonomy( aphia_ids, scientific_names = NULL, add_rank_to_hierarchy = FALSE, verbose = TRUE, aphia_id = deprecated(), scientific_name = deprecated() )add_worms_taxonomy( aphia_ids, scientific_names = NULL, add_rank_to_hierarchy = FALSE, verbose = TRUE, aphia_id = deprecated(), scientific_name = deprecated() )
aphia_ids |
Numeric vector of AphiaIDs. |
scientific_names |
Optional character vector of scientific names (same length as |
add_rank_to_hierarchy |
Logical (default FALSE). If TRUE, includes rank labels in the concatenated hierarchy string. |
verbose |
Logical (default TRUE). If TRUE, prints progress updates. |
aphia_id |
|
scientific_name |
A tibble with taxonomy columns added, including:
aphia_id, scientific_name
worms_kingdom, worms_phylum, worms_class, worms_order,
worms_family, worms_genus, worms_species
worms_scientific_name, worms_hierarchy
# Using AphiaID only try(add_worms_taxonomy(c(1080, 109604), verbose = FALSE)) # Using a combination of AphiaID and scientific name try(add_worms_taxonomy( aphia_ids = c(NA, 109604), scientific_names = c("Calanus finmarchicus", "Oithona similis"), verbose = FALSE ))# Using AphiaID only try(add_worms_taxonomy(c(1080, 109604), verbose = FALSE)) # Using a combination of AphiaID and scientific name try(add_worms_taxonomy( aphia_ids = c(NA, 109604), scientific_names = c("Calanus finmarchicus", "Oithona similis"), verbose = FALSE ))
This function assigns default phytoplankton groups (Diatoms, Dinoflagellates, Cyanobacteria, or Other)
to a list of scientific names or Aphia IDs by retrieving species information from the
World Register of Marine Species (WoRMS). The function checks both Aphia IDs and scientific names,
handles missing records, and assigns the appropriate plankton group based on taxonomic classification in WoRMS.
Additionally, custom plankton groups can be specified using the custom_groups parameter,
allowing users to define additional classifications based on specific taxonomic criteria.
assign_phytoplankton_group( scientific_names, aphia_ids = NULL, diatom_class = c("Bacillariophyceae", "Coscinodiscophyceae", "Mediophyceae", "Diatomophyceae"), dinoflagellate_class = "Dinophyceae", cyanobacteria_class = "Cyanophyceae", cyanobacteria_phylum = "Cyanobacteria", match_first_word = TRUE, marine_only = FALSE, return_class = FALSE, custom_groups = list(), verbose = TRUE )assign_phytoplankton_group( scientific_names, aphia_ids = NULL, diatom_class = c("Bacillariophyceae", "Coscinodiscophyceae", "Mediophyceae", "Diatomophyceae"), dinoflagellate_class = "Dinophyceae", cyanobacteria_class = "Cyanophyceae", cyanobacteria_phylum = "Cyanobacteria", match_first_word = TRUE, marine_only = FALSE, return_class = FALSE, custom_groups = list(), verbose = TRUE )
scientific_names |
A character vector of scientific names of marine species. |
aphia_ids |
A numeric vector of Aphia IDs corresponding to the scientific names. If provided, it improves the accuracy and speed of the matching process. The length of |
diatom_class |
A character string or vector representing the diatom class. Default is "Bacillariophyceae", "Coscinodiscophyceae", "Mediophyceae" and "Diatomophyceae". |
dinoflagellate_class |
A character string or vector representing the dinoflagellate class. Default is "Dinophyceae". |
cyanobacteria_class |
A character string or vector representing the cyanobacteria class. Default is "Cyanophyceae". |
cyanobacteria_phylum |
A character string or vector representing the cyanobacteria phylum. Default is "Cyanobacteria". |
match_first_word |
A logical value indicating whether to match the first word of the scientific name if the Aphia ID is missing. Default is TRUE. |
marine_only |
A logical value indicating whether to restrict the results to marine taxa only. Default is |
return_class |
A logical value indicating whether to include class information in the result. Default is |
custom_groups |
A named list of additional custom plankton groups (optional). The names of the list correspond to the custom group names (e.g., "Cryptophytes"), and the values should be character vectors specifying one or more of the following taxonomic levels: |
verbose |
A logical value indicating whether to print progress messages. Default is TRUE. |
The aphia_ids parameter is not necessary but, if provided, will improve the certainty of the
matching process. If aphia_ids are available, they will be used directly to retrieve more accurate
WoRMS records. If missing, the function will attempt to match the scientific names to Aphia IDs by
querying WoRMS using the scientific name(s), with an additional fallback mechanism to match based on the
first word of the scientific name.
To skip one of the default plankton groups, you can set the class or phylum of the respective group to an empty string ("").
For example, to skip the "Cyanobacteria" group, you can set cyanobacteria_class = "" or cyanobacteria_phylum = "". These
taxa will then be placed in Others.
Custom groups are processed in the order they appear in the custom_groups list. If a taxon matches
multiple custom groups, it will be assigned to the group that appears last in the list, as later matches
overwrite earlier ones. For example, if Teleaulax amphioxeia matches both Cryptophytes (class-based)
and a specific group Teleaulax (name-based), it will be assigned to Teleaulax if Teleaulax is listed after
Cryptophytes in the custom_groups list.
A tibble with two columns: scientific_name and plankton_group, where the plankton group is assigned based on taxonomic classification.
https://marinespecies.org/ for WoRMS website.
https://CRAN.R-project.org/package=worrms
# Assign plankton groups to a list of species names try(result <- assign_phytoplankton_group( scientific_names = c("Tripos fusus", "Diatoma", "Nodularia spumigena", "Octactis speculum"), verbose = FALSE)) if (exists("result")) print(result) # Improve classification by explicitly providing Aphia IDs for ambiguous taxa # Actinocyclus and Navicula are names shared by both diatoms and animals, # which can lead to incorrect group assignment without an Aphia ID try(result <- assign_phytoplankton_group( scientific_names = c("Actinocyclus", "Navicula", "Nodularia spumigena", "Tripos fusus"), aphia_ids = c(148944, 149142, NA, NA), verbose = FALSE)) if (exists("result")) print(result) # Assign plankton groups using additional custom grouping custom_groups <- list( Cryptophytes = list(class = "Cryptophyceae"), Ciliates = list(phylum = "Ciliophora") ) # Assign with custom groups try(result_custom <- assign_phytoplankton_group( scientific_names = c("Teleaulax amphioxeia", "Mesodinium rubrum", "Dinophysis acuta"), aphia_ids = c(106306, 232069, 109604), custom_groups = custom_groups, # Adding custom groups verbose = FALSE )) if (exists("result_custom")) print(result_custom)# Assign plankton groups to a list of species names try(result <- assign_phytoplankton_group( scientific_names = c("Tripos fusus", "Diatoma", "Nodularia spumigena", "Octactis speculum"), verbose = FALSE)) if (exists("result")) print(result) # Improve classification by explicitly providing Aphia IDs for ambiguous taxa # Actinocyclus and Navicula are names shared by both diatoms and animals, # which can lead to incorrect group assignment without an Aphia ID try(result <- assign_phytoplankton_group( scientific_names = c("Actinocyclus", "Navicula", "Nodularia spumigena", "Tripos fusus"), aphia_ids = c(148944, 149142, NA, NA), verbose = FALSE)) if (exists("result")) print(result) # Assign plankton groups using additional custom grouping custom_groups <- list( Cryptophytes = list(class = "Cryptophyceae"), Ciliates = list(phylum = "Ciliophora") ) # Assign with custom groups try(result_custom <- assign_phytoplankton_group( scientific_names = c("Teleaulax amphioxeia", "Mesodinium rubrum", "Dinophysis acuta"), aphia_ids = c(106306, 232069, 109604), custom_groups = custom_groups, # Adding custom groups verbose = FALSE )) if (exists("result_custom")) print(result_custom)
Calculates zooplankton biomass by combining per-individual dry weight
("Dry weight (mean)", in ug) with abundance measurements in the same
observation. Two biomass parameters are produced:
Biomass concentration (mg/m3) from "Abundance" rows
(ind/m3).
Integrated biomass (mg/m2) from "Integrated abundance" rows
(ind/m2).
calc_zooplankton_biomass( data, abundance_parameter = "Abundance", integrated_abundance_parameter = "Integrated abundance", dry_weight_parameter = "Dry weight (mean)", biomass_concentration_parameter = "Biomass concentration", integrated_biomass_parameter = "Integrated biomass", append = TRUE, drop_na_values = TRUE, keep_reference = FALSE )calc_zooplankton_biomass( data, abundance_parameter = "Abundance", integrated_abundance_parameter = "Integrated abundance", dry_weight_parameter = "Dry weight (mean)", biomass_concentration_parameter = "Biomass concentration", integrated_biomass_parameter = "Integrated biomass", append = TRUE, drop_na_values = TRUE, keep_reference = FALSE )
data |
A data frame or tibble in SHARK zooplankton format. Must contain
the observation-key columns listed in the Observation key section, plus
|
abundance_parameter |
Character string giving the parameter name for
abundance in |
integrated_abundance_parameter |
Character string giving the parameter
name for integrated abundance in |
dry_weight_parameter |
Character string giving the parameter name for
per-individual dry weight in |
biomass_concentration_parameter |
Character string used for calculated
biomass concentration rows. Defaults to |
integrated_biomass_parameter |
Character string used for calculated
integrated biomass rows. Defaults to |
append |
Logical. If |
drop_na_values |
Logical. If |
keep_reference |
Logical. If |
The conversion is biomass = dry_weight_ug * abundance / 1000 so that the
result is expressed in mg/m3 or mg/m2.
A tibble. By default, the original data are returned with calculated
biomass rows appended. If append = FALSE, only the calculated rows are
returned.
Dry-weight and abundance rows are matched one-to-one using the following columns, which together identify a single zooplankton observation in SHARK:
platform_code
station_name
sample_date
sample_time
sample_min_depth_m
sample_max_depth_m
aphia_id
sex_code
dev_stage_code
size_class
One biomass row is produced per matching abundance row (no aggregation
across size classes or development stages). Abundance rows without a
matching dry-weight value yield NA biomass and are dropped by default.
If no "Dry weight (mean)" rows are present in data, the function calls
calc_zooplankton_dry_weight() internally to compute them from
"Length (mean)" before calculating biomass.
zoo <- dplyr::tibble( platform_code = "77SE", station_name = "ANHOLT E", sample_date = as.Date("2023-06-01"), sample_time = "10:00", sample_min_depth_m = 0, sample_max_depth_m = 30, aphia_id = 104251, sex_code = NA_character_, dev_stage_code = "AD", size_class = NA_character_, parameter = c("Length (mean)", "Abundance", "Integrated abundance"), value = c(800, 120, 3600), unit = c("um", "ind/m3", "ind/m2") ) calc_zooplankton_biomass(zoo, append = FALSE)zoo <- dplyr::tibble( platform_code = "77SE", station_name = "ANHOLT E", sample_date = as.Date("2023-06-01"), sample_time = "10:00", sample_min_depth_m = 0, sample_max_depth_m = 30, aphia_id = 104251, sex_code = NA_character_, dev_stage_code = "AD", size_class = NA_character_, parameter = c("Length (mean)", "Abundance", "Integrated abundance"), value = c(800, 120, 3600), unit = c("um", "ind/m3", "ind/m2") ) calc_zooplankton_biomass(zoo, append = FALSE)
Calculates zooplankton dry weight from rows where parameter equals
"Length (mean)" using bundled taxa-specific coefficients for
mesozooplankton from Kattegat and Skagerrak.
calc_zooplankton_dry_weight( data, length_parameter = "Length (mean)", dry_weight_parameter = "Dry weight (mean)", append = TRUE, drop_na_values = TRUE, keep_reference = FALSE )calc_zooplankton_dry_weight( data, length_parameter = "Length (mean)", dry_weight_parameter = "Dry weight (mean)", append = TRUE, drop_na_values = TRUE, keep_reference = FALSE )
data |
A data frame or tibble in SHARK zooplankton format. Must contain
the columns |
length_parameter |
Character string giving the parameter name used for
mean length. Defaults to |
dry_weight_parameter |
Character string used for the calculated dry
weight rows. Defaults to |
append |
Logical. If |
drop_na_values |
Logical. If |
keep_reference |
Logical. If |
The dry weight calculation follows:
DW = 10^((B * log10(length)) - A)
For nauplii (dev_stage_code == "NP"), taxon-specific nauplii coefficients are
used when available. Otherwise, the general coefficients for
"copepod nauplii *all copepod species" are used. All other development
stages use the non-nauplii coefficients. In practice, these coefficients are
used for adults as well as copepodite stages and other non-NP stages.
The calculation assumes that SHARK "Length (mean)" values are reported in
um. Calculated dry-weight rows are assigned the unit ug.
The bundled coefficient workbook used by this function can be accessed with:
system.file("extdata", "Mesozooplankton_Kattegat_Skagerrak_taxa_and_biomass_calculations.xlsx", package = "SHARK4R").
Matching is performed using aphia_id, as preferred for SHARK data. Taxa with
no matching coefficient keep NA dry-weight values.
A tibble. By default, the original data are returned with calculated
"Dry weight (mean)" rows appended. If append = FALSE, only the calculated rows
are returned.
The bundled workbook used for these coefficients can be downloaded directly: Download coefficient workbook (.xlsx)
| Reference taxon | AphiaID | Development stage | A | B | Reference |
| Acartia bifilosa | 345919 | all stages except nauplii | 7.71 | 2.96 | Hay 1991(1) |
| Acartia clausi | 149755 | all stages except nauplii | 7.71 | 2.96 | Hay 1991(1) |
| Acartia longiremis | 346037 | all stages except nauplii | 7.71 | 2.96 | Hay 1991(1) |
| Acartia | 104108 | all stages except nauplii | 7.71 | 2.96 | Hay 1991(1) |
| Calanus finmarchicus | 104464 | all stages except nauplii | 6.88 | 2.69 | Hay 1991(1) |
| Calanus finmarchicus | 104464 | nauplii | 5.38 | 2.03 | Hygum et al. 2000(2) |
| Centropages hamatus | 104496 | all stages except nauplii | 6.09 | 2.45 | Hay 1991(1) |
| Centropages | 104159 | all stages except nauplii | 6.10 | 2.45 | Hay 1991(1) |
| Centropages typicus | 104499 | all stages except nauplii | 6.10 | 2.45 | Hay 1991(1) |
| Clausocalanus | 104161 | all stages except nauplii | 8.90 | 3.35 | Hay et al. 1988(3) |
| Corycaeus | 128634 | all stages except nauplii | 6.07 | 2.63 | Satapoomin 1999(4) |
| Cyclopoida | 106415 | all stages except nauplii | 6.72 | 2.71 | Uye 1982(5) |
| Evadne nordmanni | 106273 | all stages except nauplii | 5.79 | 2.80 | Hernroth 1985(6) |
| Fritillaria | 103358 | all stages except nauplii | 4.51 | 2.66 | Paffenhofer 1976(10) |
| Harpacticoid copepod | 1102 | all stages except nauplii | 7.24 | 2.89 | Uye 1982(5) |
| copepod nauplii *all copepod species | 1080 | nauplii | 5.48 | 2.23 | Hay 1991(1) |
| Metridia | 104190 | all stages except nauplii | 7.12 | 2.68 | Hirche and Mumm 1992(9) |
| Microcalanus | 104164 | all stages except nauplii | 7.86 | 2.91 | Hay 1991(1) |
| Microsetella | 115341 | all stages except nauplii | 7.66 | 2.88 | Satapoomin 1999(4) |
| Oikopleura dioica | 103407 | all stages except nauplii | 4.51 | 2.66 | Paffenhofer 1976(10) |
| Oithona | 106485 | nauplii | 2.68 | 2.14 | Almeda et al. 2010(8) |
| Oithona similis | 106656 | all stages except nauplii | 6.72 | 2.71 | Uye 1982(5) |
| Oncaea | 128690 | all stages except nauplii | 6.28 | 2.63 | Satapoomin 1999(4) |
| Paracalanus parvus | 104685 | all stages except nauplii | 6.16 | 2.45 | Hay 1991(1) |
| Penilia avirostris | 106272 | all stages except nauplii | 4.95 | 2.38 | Atienza et al. 2006(7) |
| Podon leukarti | 106277 | all stages except nauplii | 7.52 | 3.02 | Uye 1982(5) |
| Podon polyphemoides | 159919 | all stages except nauplii | 6.60 | 2.75 | Uye 1982(5) |
| Pseudocalanus | 104165 | all stages except nauplii | 8.37 | 3.00 | Hay et al. 1988(3) |
| Temora longicornis | 104878 | all stages except nauplii | 8.37 | 3.00 | Hay et al. 1988(3) |
For a local copy within an installed package, use:
system.file("extdata", "Mesozooplankton_Kattegat_Skagerrak_taxa_and_biomass_calculations.xlsx", package = "SHARK4R")
To download the latest version directly: Download from GitHub
Hay SJ, Kiørboe T, Matthews A (1991) Zooplankton biomass and production in the North Sea during the Autumn Circulation experiment, October 1987-March 1988. Continental Shelf Research 11(12):1453-1476. doi:10.1016/0278-4343(91)90021-W
Hygum BH, Rey C, Hansen BW (2000) Growth and development rates of Calanus finmarchicus nauplii during a diatom spring bloom. Marine Biology 136:1075-1085. doi:10.1007/s002270000313
Hay SJ, Evans GT, Gamble JC (1988) Birth, growth and death rates for enclosed populations of calanoid copepods. Journal of Plankton Research 10(3):431-454. doi:10.1093/plankt/10.3.431
Satapoomin S (1999) Carbon content of some common tropical Andaman Sea copepods. Journal of Plankton Research 21(11):2117-2123. doi:10.1093/plankt/21.11.2117
Uye SI (1982) Length-weight relationships of important zooplankton from the Inland Sea of Japan. Journal of the Oceanographical Society of Japan 38:149-158. doi:10.1007/BF02110286
Hernroth L, ed. (1985) Recommendations on methods for marine biological studies in the Baltic Sea: mesozooplankton biomass assessment / individual volume technique. Publication / The Baltic Marine Biologists - BMB, 10. Lysekil: Institute of Marine Research.
Atienza D, Saiz E, Calbet A (2006) Feeding ecology of the marine cladoceran Penilia avirostris: natural diet, prey selectivity and daily ration. Marine Ecology Progress Series 315:211-220. doi:10.3354/meps315211
Almeda R, Calbet A, Alcaraz M, Yebra L, Saiz E (2010) Effects of temperature and food concentration on the survival, development and growth rates of naupliar stages of Oithona davisae (Copepoda, Cyclopoida). Marine Ecology Progress Series 410:97-109. doi:10.3354/meps08625
Hirche HJ, Mumm N (1992) Distribution of dominant copepods in the Nansen Basin, Arctic Ocean, in summer. Deep-Sea Research Part A. Oceanographic Research Papers 39(Suppl. 2):S485-S505. doi:10.1016/S0198-0149(06)80017-8
Paffenhöfer GA (1976) On the biology of appendicularia of the southeastern North Sea. In: 10th European Symposium on Marine Biology, Ostend, Belgium, 17-23 September 1975, Vol. 2, pp. 437-455.
# Minimal example with a few rows zoo <- dplyr::tibble( scientific_name = c("Acartia clausi", "Calanus finmarchicus", "Unknown taxon"), parameter = c("Length (mean)", "Length (mean)", "Length (mean)"), value = c(200, 250, 160), aphia_id = c(104251, 104464, 999999), dev_stage_code = c("AD", "NP", "NP") ) # Calculate dry weight rows only calc_zooplankton_dry_weight(zoo, append = FALSE) # Download zooplankton data from SHARK zoo_shark <- get_shark_data( "sharkdata_zooplankton", dataTypes = "Zooplankton", fromYear = 2023, toYear = 2023, stationName = "ANHOLT E", verbose = FALSE, ) # Calculate dry weight from "Length (mean)" and return only the new rows calc_zooplankton_dry_weight( zoo_shark, append = FALSE )# Minimal example with a few rows zoo <- dplyr::tibble( scientific_name = c("Acartia clausi", "Calanus finmarchicus", "Unknown taxon"), parameter = c("Length (mean)", "Length (mean)", "Length (mean)"), value = c(200, 250, 160), aphia_id = c(104251, 104464, 999999), dev_stage_code = c("AD", "NP", "NP") ) # Calculate dry weight rows only calc_zooplankton_dry_weight(zoo, append = FALSE) # Download zooplankton data from SHARK zoo_shark <- get_shark_data( "sharkdata_zooplankton", dataTypes = "Zooplankton", fromYear = 2023, toYear = 2023, stationName = "ANHOLT E", verbose = FALSE, ) # Calculate dry weight from "Length (mean)" and return only the new rows calc_zooplankton_dry_weight( zoo_shark, append = FALSE )
This function checks whether the codes reported in a specified column of a
dataset (e.g., project codes, ship codes, etc.) are present in the
official SHARK codelist provided by SMHI. If a cell contains multiple codes
separated by commas, each code is checked individually. The function downloads
and caches the codelist if necessary, compares the reported values against
the valid codes, and returns a tibble showing which codes matched.
Informative messages are printed if unmatched codes are found.
check_codes( data, field = "sample_project_name_en", code_type = "PROJ", match_column = "Description/English translate", clean_cache_days = 30, verbose = TRUE )check_codes( data, field = "sample_project_name_en", code_type = "PROJ", match_column = "Description/English translate", clean_cache_days = 30, verbose = TRUE )
data |
A tibble (or data.frame) containing the codes to check. |
field |
Character; name of the column in |
code_type |
Character; the type of code to check (e.g., |
match_column |
Character; the column in the SHARK codelist to match
against. Must be one of |
clean_cache_days |
Numeric; if not |
verbose |
Logical. If TRUE, messages will be displayed during execution. Defaults to TRUE. |
A tibble with unique reported codes (after splitting comma-separated
entries) and a logical column match_type indicating if they exist in the
SHARK codelist.
get_shark_codes() to get the current code list.
clean_shark4r_cache() to manually clear cached files.
This function checks whether the required and recommended global and datatype-specific SHARK system fields are present in a data frame.
Required fields: Missing or empty required fields are reported as errors.
Recommended fields: Missing or empty recommended fields are reported as warnings,
but only if level = "warning" is specified.
check_datatype(data, level = "error")check_datatype(data, level = "error")
data |
A |
level |
Character. The level of validation:
|
A tibble summarizing missing or empty fields, with columns:
level: "error" or "warning".
field: Name of the missing or empty field.
row: Row number where the value is missing (NA) or NA if the whole column is missing.
message: Description of the issue.
# Example with required fields missing df <- data.frame( visit_year = 2024, station_name = NA ) check_datatype(df, level = "error") # Example checking recommended fields as warnings check_datatype(df, level = "warning")# Example with required fields missing df <- data.frame( visit_year = 2024, station_name = NA ) check_datatype(df, level = "error") # Example checking recommended fields as warnings check_datatype(df, level = "warning")
check_depth() inspects one or two depth columns in a dataset and reports
potential problems such as missing values, non-numeric entries, or values
that conflict with bathymetry and shoreline information. It can also
validate depths against bathymetry data retrieved from a terra::SpatRaster
object or, if bathymetry = NULL, via the lookup_xy() function, which calls
the OBIS XY lookup API to obtain bathymetry (using EMODnet Bathymetry) and shore distance.
check_depth( data, depth_cols = c("sample_min_depth_m", "sample_max_depth_m"), lat_col = "sample_latitude_dd", lon_col = "sample_longitude_dd", report = TRUE, depthmargin = 0, shoremargin = NA, bathymetry = NULL )check_depth( data, depth_cols = c("sample_min_depth_m", "sample_max_depth_m"), lat_col = "sample_latitude_dd", lon_col = "sample_longitude_dd", report = TRUE, depthmargin = 0, shoremargin = NA, bathymetry = NULL )
data |
A data frame containing sample metadata, including longitude, latitude, and one or two depth columns. |
depth_cols |
Character vector naming the depth column(s). Can be one
column (e.g., |
lat_col |
Name of the column containing latitude values. Default:
|
lon_col |
Name of the column containing longitude values. Default:
|
report |
Logical. If |
depthmargin |
Numeric. Allowed deviation (in meters) above bathymetry
before a depth is flagged as an error. Default = |
shoremargin |
Numeric. Minimum offshore distance (in meters) required
for negative depths to be considered valid. If |
bathymetry |
Optional terra::SpatRaster object with one layer giving
bathymetry values. If |
The following checks are performed:
Missing depth column → warning
Empty depth column (all values missing) → warning
Non-numeric depth values → warning
Depth exceeds bathymetry + margin (depthmargin) → warning
Negative depth at offshore locations (beyond shoremargin) → warning
Minimum depth greater than maximum depth (if two columns supplied) → error
Longitude/latitude outside raster bounds → warning
Missing bathymetry value at coordinate → warning
The function has been modified from the obistools package (Provoost and Bosch, 2024).
A tibble with one row per detected problem, containing:
Severity of the issue ("warning" or "error").
Row index in the input data where the issue occurred.
Name of the column(s) involved.
Human-readable description of the problem.
If report = FALSE, returns the subset of input rows that failed any check.
Provoost P, Bosch S (2024). “obistools: Tools for data enhancement and quality control” Ocean Biodiversity Information System. Intergovernmental Oceanographic Commission of UNESCO. R package version 0.1.0, https://iobis.github.io/obistools/.
# Example dataset with one depth column example_data <- data.frame( sample_latitude_dd = c(59.3, 58.1, 57.5), sample_longitude_dd = c(18.0, 17.5, 16.2), sample_depth_m = c(10, -5, NA) ) # Validate depths using OBIS XY lookup (bathymetry = NULL) try(check_depth(example_data, depth_cols = "sample_depth_m")) # Example dataset with min/max depth columns example_data2 <- data.frame( sample_latitude_dd = c(59.0, 58.5), sample_longitude_dd = c(18.0, 17.5), sample_min_depth_m = c(5, 15), sample_max_depth_m = c(3, 20) ) try(check_depth(example_data2, depth_cols = c("sample_min_depth_m", "sample_max_depth_m"))) # Return only failing rows try(check_depth(example_data, depth_cols = "sample_depth_m", report = FALSE))# Example dataset with one depth column example_data <- data.frame( sample_latitude_dd = c(59.3, 58.1, 57.5), sample_longitude_dd = c(18.0, 17.5, 16.2), sample_depth_m = c(10, -5, NA) ) # Validate depths using OBIS XY lookup (bathymetry = NULL) try(check_depth(example_data, depth_cols = "sample_depth_m")) # Example dataset with min/max depth columns example_data2 <- data.frame( sample_latitude_dd = c(59.0, 58.5), sample_longitude_dd = c(18.0, 17.5), sample_min_depth_m = c(5, 15), sample_max_depth_m = c(3, 20) ) try(check_depth(example_data2, depth_cols = c("sample_min_depth_m", "sample_max_depth_m"))) # Return only failing rows try(check_depth(example_data, depth_cols = "sample_depth_m", report = FALSE))
This function checks a SHARK data frame against the required and recommended
fields defined for a specific datatype. It verifies that all required fields
are present and contain non-empty values. If level = "warning", it
also checks for recommended fields and empty values within them.
Note: A single "*" marks required fields in the standard SHARK template. A double "**" is often used to specify columns required for national monitoring only. For more information, see: https://www.smhi.se/data/hav-och-havsmiljo/datavardskap-oceanografi-och-marinbiologi/leverera-data
check_fields( data, datatype, level = "error", stars = 1, bacterioplankton_subtype = "abundance", field_definitions = .field_definitions )check_fields( data, datatype, level = "error", stars = 1, bacterioplankton_subtype = "abundance", field_definitions = .field_definitions )
data |
A data frame containing SHARK data to be validated. |
datatype |
A string giving the SHARK datatype to validate against.
Must exist as a name in the provided |
level |
Character string, either |
stars |
Integer. Maximum number of "" levels to include.
Default = 1 (only single "").
For example, |
bacterioplankton_subtype |
Character. For "Bacterioplankton" only: either "abundance" (default) or "production". Ignored for other datatypes. |
field_definitions |
A named list of field definitions. Each element
should contain two character vectors: |
Field definitions for SHARK data can be loaded in two ways:
From the SHARK4R package bundle (default):
The package contains a built-in object, .field_definitions,
which stores required and recommended fields for each datatype.
From GitHub (latest official version): To use the most up-to-date field definitions, you can load them directly from the SHARK4R-statistics repository:
defs <- load_shark4r_fields()
check_fields(my_data, "Phytoplankton", field_definitions = defs)
Delivery-format (all-caps) data:
If the column names in data are all uppercase (e.g. SDATE), check_fields() assumes
the dataset follows the official SHARK delivery template. In this case:
Required fields are determined from the delivery template using
get_delivery_template() and find_required_fields().
Recommended fields are ignored because the delivery templates do not define them.
The function validates that all required columns exist and contain non-empty values.
This ensures that both internal SHARK4R datasets (with camelCase or snake_case columns)
and official delivery files (ALL_CAPS columns) are validated correctly using the appropriate rules.
Stars in the template
Leading asterisks in the delivery template indicate required levels:
* = standard required column
* = required for national monitoring
Other symbols = additional requirement level
The stars parameter in check_fields() controls how many levels of required
columns to include.
A tibble with the following columns:
Either "error" or "warning".
The name of the field that triggered the check.
Row number(s) in data where the issue occurred, or NA
if the whole field is missing.
A descriptive message explaining the problem.
The tibble will be empty if no problems are found.
load_shark4r_fields for fetching the latest field definitions from GitHub,
get_delivery_template for downloading delivery templates from SMHI's website.
# Example 1: Using built-in field definitions for "Phytoplankton" df_phyto <- data.frame( visit_date = "2023-06-01", sample_id = "S1", scientific_name = "Skeletonema marinoi", value = 123 ) # Check fields check_fields(df_phyto, "Phytoplankton", level = "warning") # Example 2: Load latest definitions from GitHub and use them try(defs <- load_shark4r_fields(verbose = FALSE)) # Check fields using loaded field definitions if (exists("defs")) try(check_fields(df_phyto, "Phytoplankton", field_definitions = defs)) # Example 3: Custom datatype with required + recommended fields defs <- list( ExampleType = list( required = c("id", "value"), recommended = "comment" ) ) # Example data df_ok <- data.frame(id = 1, value = "x", comment = "ok") # Check fields using custom field definitions check_fields(df_ok, "ExampleType", level = "warning", field_definitions = defs)# Example 1: Using built-in field definitions for "Phytoplankton" df_phyto <- data.frame( visit_date = "2023-06-01", sample_id = "S1", scientific_name = "Skeletonema marinoi", value = 123 ) # Check fields check_fields(df_phyto, "Phytoplankton", level = "warning") # Example 2: Load latest definitions from GitHub and use them try(defs <- load_shark4r_fields(verbose = FALSE)) # Check fields using loaded field definitions if (exists("defs")) try(check_fields(df_phyto, "Phytoplankton", field_definitions = defs)) # Example 3: Custom datatype with required + recommended fields defs <- list( ExampleType = list( required = c("id", "value"), recommended = "comment" ) ) # Example data df_ok <- data.frame(id = 1, value = "x", comment = "ok") # Check fields using custom field definitions check_fields(df_ok, "ExampleType", level = "warning", field_definitions = defs)
This function checks for logical rule violations in benthos/epibenthos data
by applying a user-defined condition to values for a given parameter.
It is intended to replace the old family of check_*_*_logical() functions.
check_logical_parameter( data, param_name, condition, return_df = FALSE, return_logical = FALSE )check_logical_parameter( data, param_name, condition, return_df = FALSE, return_logical = FALSE )
data |
A data frame. Must contain columns |
param_name |
Character; the name of the parameter to check. |
condition |
A function that takes a numeric vector of values and returns a logical vector (TRUE for rows considered problematic). |
return_df |
Logical. If TRUE, return a plain data.frame of problematic rows. |
return_logical |
Logical. If TRUE, return a logical vector of length nrow(data). Overrides return_df. |
A DT datatable, a data.frame, a logical vector, or NULL if no problems found.
# Example dataset df <- dplyr::tibble( station_name = c("A1", "A2", "A3", "A4"), sample_date = as.Date("2023-05-01") + 0:3, sample_id = 101:104, parameter = c("Biomass", "Biomass", "Abundance", "Biomass"), value = c(5, -2, 10, 0) ) # 1. Check that Biomass is never negative check_logical_parameter(df, "Biomass", function(x) x < 0, return_df = TRUE) # 2. Same check, but return problematic rows as a data frame check_logical_parameter(df, "Biomass", function(x) x < 0, return_df = TRUE) # 3. Return logical vector marking problematic rows check_logical_parameter(df, "Biomass", function(x) x < 0, return_logical = TRUE) # 4. Check that Abundance is not zero (no problems found -> returns NULL) abundance_check <- check_logical_parameter(df, "Abundance", function(x) x == 0) print(abundance_check)# Example dataset df <- dplyr::tibble( station_name = c("A1", "A2", "A3", "A4"), sample_date = as.Date("2023-05-01") + 0:3, sample_id = 101:104, parameter = c("Biomass", "Biomass", "Abundance", "Biomass"), value = c(5, -2, 10, 0) ) # 1. Check that Biomass is never negative check_logical_parameter(df, "Biomass", function(x) x < 0, return_df = TRUE) # 2. Same check, but return problematic rows as a data frame check_logical_parameter(df, "Biomass", function(x) x < 0, return_df = TRUE) # 3. Return logical vector marking problematic rows check_logical_parameter(df, "Biomass", function(x) x < 0, return_logical = TRUE) # 4. Check that Abundance is not zero (no problems found -> returns NULL) abundance_check <- check_logical_parameter(df, "Abundance", function(x) x == 0) print(abundance_check)
This function attempts to determine whether stations in a dataset are reported using nominal positions (i.e., generic or repeated coordinates across events), rather than actual measured coordinates.
check_nominal_station(data, verbose = TRUE)check_nominal_station(data, verbose = TRUE)
data |
A data frame containing at least the columns:
|
verbose |
Logical. If TRUE, messages will be displayed during execution. Defaults to TRUE. |
The function compares the number of unique sampling dates with the number of unique station coordinates.
If the number of unique sampling dates is larger than the number of unique station coordinates, the function suspects nominal station positions and issues a warning.
A data frame with distinct station names and their corresponding
latitude/longitude positions, if nominal positions are suspected.
Otherwise, returns NULL.
df <- data.frame( sample_date = rep(seq.Date(Sys.Date(), by = "day", length.out = 3), each = 2), station_name = rep(c("ST1", "ST2"), 3), sample_longitude_dd = rep(c(15.0, 16.0), 3), sample_latitude_dd = rep(c(58.5, 58.6), 3) ) check_nominal_station(df)df <- data.frame( sample_date = rep(seq.Date(Sys.Date(), by = "day", length.out = 3), each = 2), station_name = rep(c("ST1", "ST2"), 3), sample_longitude_dd = rep(c(15.0, 16.0), 3), sample_latitude_dd = rep(c(58.5, 58.6), 3) ) check_nominal_station(df)
Identifies records whose coordinates fall on land, optionally applying a buffer to allow points near the coast.
check_onland( data, land = NULL, report = FALSE, buffer = 0, offline = FALSE, plot_leaflet = FALSE, only_bad = FALSE )check_onland( data, land = NULL, report = FALSE, buffer = 0, offline = FALSE, plot_leaflet = FALSE, only_bad = FALSE )
data |
A data frame containing at least |
land |
Optional |
report |
Logical; if |
buffer |
Numeric; distance in meters inland for which points are still considered valid. Only used in online mode. Default is 0. |
offline |
Logical; if |
plot_leaflet |
Logical; if |
only_bad |
Logical; if |
The function supports both offline and online modes:
Offline mode (offline = TRUE): uses a local simplified shoreline from a cached
geopackage (land.gpkg). If the file does not exist, it is downloaded automatically and cached across R sessions.
Online mode (offline = FALSE): uses the OBIS web service to determine distance to the shore.
The function assumes all coordinates are in WGS84 (EPSG:4326). Supplying coordinates in a different CRS will result in incorrect intersection tests.
Optionally, a leaflet map can be plotted. Points on land are displayed as red markers,
while points in water are green. If only_bad = TRUE, only the red points (on land) are plotted.
If report = TRUE, a tibble with columns:
field: always NA (placeholder for future extension)
level: "warning" for all flagged rows
row: row numbers in data flagged as located on land
message: description of the issue
If report = FALSE and plot_leaflet = FALSE, returns a subset of data with only the flagged rows.
If plot_leaflet = TRUE, returns a leaflet map showing points on land (red) and in water (green),
unless only_bad = TRUE, in which case only red points are plotted.
# Example data frame with coordinates example_data <- data.frame( sample_latitude_dd = c(59.3, 58.1, 57.5), sample_longitude_dd = c(18.6, 17.5, 16.7) ) # Report points on land with a 100 m buffer try(report <- check_onland(example_data, report = TRUE, buffer = 100)) if (exists("report")) print(report) # Plot all points colored by land/water try(map <- check_onland(example_data, plot_leaflet = TRUE)) # Plot only bad points on land try(map_bad <- check_onland(example_data, plot_leaflet = TRUE, only_bad = TRUE)) # Remove points on land by adding a buffer of 2000 m try(ok <- check_onland(example_data, report = FALSE, buffer = 2000)) if (exists("ok")) print(nrow(ok))# Example data frame with coordinates example_data <- data.frame( sample_latitude_dd = c(59.3, 58.1, 57.5), sample_longitude_dd = c(18.6, 17.5, 16.7) ) # Report points on land with a 100 m buffer try(report <- check_onland(example_data, report = TRUE, buffer = 100)) if (exists("report")) print(report) # Plot all points colored by land/water try(map <- check_onland(example_data, plot_leaflet = TRUE)) # Plot only bad points on land try(map_bad <- check_onland(example_data, plot_leaflet = TRUE, only_bad = TRUE)) # Remove points on land by adding a buffer of 2000 m try(ok <- check_onland(example_data, report = FALSE, buffer = 2000)) if (exists("ok")) print(nrow(ok))
This function checks whether values for a specified parameter exceed a predefined
threshold. Thresholds are provided in a dataframe (default .threshold_values),
which should contain columns for parameter, datatype, and at least one numeric
threshold column (e.g., extreme_upper). Only rows in data matching both the
parameter and delivery_datatype (datatype) are considered. Optionally, data
can be grouped by a custom column (e.g., location_sea_basin) when thresholds vary by group.
check_outliers( data, parameter, datatype, threshold_col = "extreme_upper", thresholds = .threshold_values, custom_group = NULL, direction = c("above", "below"), return_df = FALSE, verbose = TRUE )check_outliers( data, parameter, datatype, threshold_col = "extreme_upper", thresholds = .threshold_values, custom_group = NULL, direction = c("above", "below"), return_df = FALSE, verbose = TRUE )
data |
A tibble containing data in SHARK format. Must include columns:
|
parameter |
Character. Name of the parameter to check. Must exist in both
|
datatype |
Character. Data type to match against |
threshold_col |
Character. Name of the threshold column in |
thresholds |
A tibble/data frame of thresholds. Must include columns |
custom_group |
Character or NULL. Optional column name in |
direction |
Character. Either |
return_df |
Logical. If TRUE, returns a plain data.frame of flagged rows instead of a DT datatable. Default = FALSE. |
verbose |
Logical. If TRUE, messages will be displayed during execution. Defaults to TRUE. |
Only rows in data matching both parameter and delivery_datatype are checked.
If custom_group is specified, thresholds are applied per group.
If threshold_col does not exist in thresholds, the function stops with a warning.
Values exceeding (or below) the threshold are flagged as outliers.
Intended for interactive use in Shiny apps where threshold_col can be selected dynamically.
If outliers are found, returns a DT::datatable or a data.frame (if return_df = TRUE)
containing:
datatype, station_name, sample_date, sample_id, parameter, value, threshold,
and custom_group if specified. Otherwise, prints a message indicating that values
are within the threshold range (if verbose = TRUE) and returns invisible(NULL).
get_shark_statistics() for preparing updated threshold data.
# Minimal example dataset example_data <- dplyr::tibble( station_name = c("S1", "S2"), sample_date = as.Date(c("2025-01-01", "2025-01-02")), sample_id = 1:2, shark_sample_id_md5 = letters[1:2], sample_min_depth_m = c(0, 5), sample_max_depth_m = c(1, 6), parameter = c("Param1", "Param1"), value = c(5, 12), delivery_datatype = c("TypeA", "TypeA") ) example_thresholds <- dplyr::tibble( parameter = "Param1", datatype = "TypeA", extreme_upper = 10, mild_upper = 8 ) # Check for values above "extreme_upper" check_outliers( data = example_data, parameter = "Param1", datatype = "TypeA", threshold_col = "extreme_upper", thresholds = example_thresholds, return_df = TRUE ) # Check for values above "mild_upper" check_outliers( data = example_data, parameter = "Param1", datatype = "TypeA", threshold_col = "mild_upper", thresholds = example_thresholds, return_df = TRUE )# Minimal example dataset example_data <- dplyr::tibble( station_name = c("S1", "S2"), sample_date = as.Date(c("2025-01-01", "2025-01-02")), sample_id = 1:2, shark_sample_id_md5 = letters[1:2], sample_min_depth_m = c(0, 5), sample_max_depth_m = c(1, 6), parameter = c("Param1", "Param1"), value = c(5, 12), delivery_datatype = c("TypeA", "TypeA") ) example_thresholds <- dplyr::tibble( parameter = "Param1", datatype = "TypeA", extreme_upper = 10, mild_upper = 8 ) # Check for values above "extreme_upper" check_outliers( data = example_data, parameter = "Param1", datatype = "TypeA", threshold_col = "extreme_upper", thresholds = example_thresholds, return_df = TRUE ) # Check for values above "mild_upper" check_outliers( data = example_data, parameter = "Param1", datatype = "TypeA", threshold_col = "mild_upper", thresholds = example_thresholds, return_df = TRUE )
Applies parameter-specific and row-wise logical rules to benthos/epibenthos data,
flagging measurements that violate defined conditions. This function replaces
multiple deprecated check_*_logical() functions with a general, flexible implementation.
check_parameter_rules( data, param_conditions = get(".param_conditions", envir = asNamespace("SHARK4R")), rowwise_conditions = get(".rowwise_conditions", envir = asNamespace("SHARK4R")), return_df = FALSE, return_logical = FALSE, verbose = TRUE )check_parameter_rules( data, param_conditions = get(".param_conditions", envir = asNamespace("SHARK4R")), rowwise_conditions = get(".rowwise_conditions", envir = asNamespace("SHARK4R")), return_df = FALSE, return_logical = FALSE, verbose = TRUE )
data |
A data frame containing at least the columns |
param_conditions |
A named list of parameter-specific rules. Each element should be a list with:
Defaults to |
rowwise_conditions |
A named list of row-wise rules applied across multiple parameters.
Each element should be a function taking the full data frame and returning a logical vector.
Defaults to |
return_df |
Logical. If TRUE, problematic rows are returned as plain |
return_logical |
Logical. If TRUE, problematic rows are returned as logical vectors.
Overrides |
verbose |
Logical. If TRUE, messages will be displayed during execution. Defaults to TRUE. |
This function evaluates each parameter in param_conditions and rowwise_conditions.
Only parameters present in the dataset are checked. Messages are printed
indicating whether values are within expected ranges or which rows violate rules.
A named list of results for each parameter:
If return_logical = TRUE.
If return_df = TRUE and violations exist.
If violations exist and return_df = FALSE.
If no violations exist for the parameter.
Invisible return.
df <- data.frame( station_name = c("A1", "A2", "A3", "A4"), sample_date = as.Date("2023-05-01") + 0:3, sample_id = 101:104, parameter = c("Wet weight", "Wet weight", "Abundance", "BQIm"), value = c(0, 5, 0, 3) ) # Check against default package rules check_parameter_rules(df) # Return problematic rows as data.frame check_parameter_rules(df, return_df = TRUE) # Return logical vectors for each parameter rule_check <- check_parameter_rules(df, return_logical = TRUE) print(rule_check)df <- data.frame( station_name = c("A1", "A2", "A3", "A4"), sample_date = as.Date("2023-05-01") + 0:3, sample_id = 101:104, parameter = c("Wet weight", "Wet weight", "Abundance", "BQIm"), value = c(0, 5, 0, 3) ) # Check against default package rules check_parameter_rules(df) # Return problematic rows as data.frame check_parameter_rules(df, return_df = TRUE) # Return logical vectors for each parameter rule_check <- check_parameter_rules(df, return_logical = TRUE) print(rule_check)
This function downloads the products folder from
the SHARK4R GitHub repository and places them in a user-specified directory.
These folders contain Shiny applications and R Markdown documents used for
quality control (QC) of SHARK data.
check_setup(path, run_app = FALSE, force = FALSE, verbose = TRUE)check_setup(path, run_app = FALSE, force = FALSE, verbose = TRUE)
path |
Character string giving the directory where the products folder should be created. Must be provided by the user. |
run_app |
Logical, if |
force |
Logical, if |
verbose |
Logical, if |
If the path folders already exist, the download will be skipped unless
force = TRUE is specified. Optionally, the function can launch the
QC Shiny app directly after setup.
An (invisible) list with the path to the local products folder:
# Download support files into a temporary directory try(check_setup(path = tempdir())) # Force re-download if already present try(check_setup(path = tempdir(), force = TRUE)) # Download and run the QC Shiny app if(interactive()){ try(check_setup(path = tempdir(), run_app = TRUE)) }# Download support files into a temporary directory try(check_setup(path = tempdir())) # Force re-download if already present try(check_setup(path = tempdir(), force = TRUE)) # Download and run the QC Shiny app if(interactive()){ try(check_setup(path = tempdir(), run_app = TRUE)) }
Matches reported station names against the SMHI curated station list
("station.txt") and checks whether matched stations fall within
pre-defined distance limits. This helps ensure that station assignments
are spatially consistent.
check_station_distance( data, station_file = NULL, plot_leaflet = FALSE, try_synonyms = TRUE, fallback_crs = 4326, only_bad = FALSE, verbose = TRUE )check_station_distance( data, station_file = NULL, plot_leaflet = FALSE, try_synonyms = TRUE, fallback_crs = 4326, only_bad = FALSE, verbose = TRUE )
data |
A data frame containing at least the columns:
|
station_file |
Optional path to a custom station file (tab-delimited).
If |
plot_leaflet |
Logical; if |
try_synonyms |
Logical; if |
fallback_crs |
Integer; CRS (EPSG code) to use when creating spatial
points if no CRS is available. Defaults to |
only_bad |
Logical; if |
verbose |
Logical. If TRUE, messages will be displayed during execution. Defaults to TRUE. |
Optionally, a leaflet map of stations can be plotted. SMHI stations that match the reported data are shown as blue circles, with their allowed radius visualized and displayed in the popup (e.g., "ST1 (Radius: 1000 m)"). Reported stations are shown as markers colored by whether they fall within the radius (green), outside the radius (red), or unmatched (gray).
If try_synonyms = TRUE, the function will attempt a second match
using the SYNONYM_NAMES column in the station database, splitting
multiple synonyms separated by <or>.
The function first checks if a station file path is provided via the
station_file argument. If not, it looks for the
NODC_CONFIG environment variable. This variable can point to a folder
where the NODC (Swedish National Oceanographic Data Center) configuration and station file
are stored, typically including:
<NODC_CONFIG>/config/station.txt
If NODC_CONFIG is set and the folder exists, the function will use
station.txt from that location. Otherwise, it falls back to the
bundled station.zip included in the SHARK4R package.
If plot_leaflet = FALSE, returns a data frame with columns:
Reported station name.
TRUE if station matched in SMHI list, FALSE otherwise.
Distance in meters from reported station to matched SMHI station.
TRUE if distance <= allowed radius, FALSE if outside, NA if unmatched.
If plot_leaflet = TRUE, the function produces a leaflet map showing:
Blue circles for SMHI stations with radius in the popup.
Reported stations colored by status: green (within radius), red (outside radius), gray (unmatched).
If only_bad = TRUE, only the red stations (outside radius) are displayed.
# Example data df <- data.frame( station_name = c("ANHOLT E", "BY5 BORNHOLMSDJ", "NEW STATION"), sample_longitude_dd = c(12.1, 15.97, 17.5), sample_latitude_dd = c(56.7, 55.25, 58.7) ) # Check station distance try(check_station_distance(df, try_synonyms = TRUE, verbose = FALSE)) # Plot bad points in leaflet map try(map <- check_station_distance(df, plot_leaflet = TRUE, only_bad = TRUE, verbose = FALSE))# Example data df <- data.frame( station_name = c("ANHOLT E", "BY5 BORNHOLMSDJ", "NEW STATION"), sample_longitude_dd = c(12.1, 15.97, 17.5), sample_latitude_dd = c(56.7, 55.25, 58.7) ) # Check station distance try(check_station_distance(df, try_synonyms = TRUE, verbose = FALSE)) # Plot bad points in leaflet map try(map <- check_station_distance(df, plot_leaflet = TRUE, only_bad = TRUE, verbose = FALSE))
This function checks whether entries in the value column of a dataset are valid
numeric or logical values. It is particularly useful for identifying common data
entry errors such as inequality symbols (<, >) or unintended text strings
(e.g., "NA", "below detection"). The function reports any invalid entries
in an interactive DT::datatable for easy inspection.
check_value_logical(data, return_df = FALSE)check_value_logical(data, return_df = FALSE)
data |
A data frame. Must contain a column named |
return_df |
Logical. If TRUE, return a plain data.frame of problematic rows instead of a DT datatable. Default = FALSE. |
A DT::datatable or data frame listing unique invalid entries, or NULL (invisibly)
if all values are correctly formatted as numeric or logical.
# Example dataset with mixed valid and invalid values df <- data.frame( station_name = c("A", "B", "C", "D", "E"), value = c("3.4", "<0.2", "TRUE", "NA", "5e-3") ) # Check for invalid (non-numeric / non-logical) entries check_value_logical(df, return_df = TRUE) # Example with all valid numeric and logical values df_valid <- data.frame(value = c(1.2, 0, TRUE, FALSE, 3.5)) check_value_logical(df_valid)# Example dataset with mixed valid and invalid values df <- data.frame( station_name = c("A", "B", "C", "D", "E"), value = c("3.4", "<0.2", "TRUE", "NA", "5e-3") ) # Check for invalid (non-numeric / non-logical) entries check_value_logical(df, return_df = TRUE) # Example with all valid numeric and logical values df_valid <- data.frame(value = c(1.2, 0, TRUE, FALSE, 3.5)) check_value_logical(df_valid)
This function inspects a dataset containing sample coordinates to identify potential issues where longitude or latitude values are zero (0), which typically indicate missing or erroneous station positions. The function can return a summary table, a filtered data frame, or a logical vector highlighting problematic rows. It is useful as a data quality control step before spatial analyses or database imports.
check_zero_positions( data, coord = "longitude", return_df = FALSE, return_logical = FALSE, verbose = TRUE )check_zero_positions( data, coord = "longitude", return_df = FALSE, return_logical = FALSE, verbose = TRUE )
data |
A data frame. Must contain |
coord |
Character. Which coordinate(s) to check: "longitude", "latitude", or "both". Default = "longitude". |
return_df |
Logical. If TRUE, return a plain data.frame of problematic rows instead of a DT datatable. Default = FALSE. |
return_logical |
Logical. If TRUE, return a logical vector of length nrow(data) indicating which rows have zero in the selected coordinate(s). Overrides return_df. Default = FALSE. |
verbose |
Logical. If TRUE, messages will be displayed during execution. Defaults to TRUE. |
A DT datatable, a data.frame, a logical vector, or NULL (if no problems found and return_logical = FALSE).
# Example data df <- data.frame( station_name = c("A", "B", "C"), sample_longitude_dd = c(15.2, 0, 18.7), sample_latitude_dd = c(56.3, 58.1, 0) ) # Check for zeroes in both coordinates and return as data.frame check_zero_positions(df, coord = "both", return_df = TRUE) # Return a logical vector instead of a table check_zero_positions(df, coord = "both", return_logical = TRUE)# Example data df <- data.frame( station_name = c("A", "B", "C"), sample_longitude_dd = c(15.2, 0, 18.7), sample_latitude_dd = c(56.3, 58.1, 0) ) # Check for zeroes in both coordinates and return as data.frame check_zero_positions(df, coord = "both", return_df = TRUE) # Return a logical vector instead of a table check_zero_positions(df, coord = "both", return_logical = TRUE)
This function scans a dataset for cases where the measurement column (value)
contains zero (0) values, which may indicate missing, censored, or erroneous data.
It returns either a DT::datatable for easy inspection or a plain data.frame of
the affected rows. This function is useful for quality control and validation
prior to data aggregation, reporting, or database submission.
check_zero_value(data, return_df = FALSE)check_zero_value(data, return_df = FALSE)
data |
A data frame. Must contain a column named |
return_df |
Logical. If TRUE, return a plain data.frame of problematic rows instead of a DT datatable. Default = FALSE. |
A DT datatable or a data.frame of zero-value records, or NULL (invisibly)
if no zero values are found.
# Example dataset df <- data.frame( station_name = c("A", "B", "C", "D"), sample_date = as.Date(c("2023-06-01", "2023-06-02", "2023-06-03", "2023-06-04")), value = c(3.2, 0, 1.5, 0) ) # Return a plain data.frame of zero-value records check_zero_value(df, return_df = TRUE)# Example dataset df <- data.frame( station_name = c("A", "B", "C", "D"), sample_date = as.Date(c("2023-06-01", "2023-06-02", "2023-06-03", "2023-06-04")), value = c(3.2, 0, 1.5, 0) ) # Return a plain data.frame of zero-value records check_zero_value(df, return_df = TRUE)
Deletes cached files in the SHARK4R cache directory that are older than a specified number of days.
clean_shark4r_cache( days = 1, cache_dir = NULL, clear_perm_cache = FALSE, search_pattern = NULL, verbose = TRUE )clean_shark4r_cache( days = 1, cache_dir = NULL, clear_perm_cache = FALSE, search_pattern = NULL, verbose = TRUE )
days |
Numeric; remove files older than this number of days. Default is 1. |
cache_dir |
Character; path to the cache directory to clean.
Defaults to the package cache directory in the user-specific R folder
(via the internal |
clear_perm_cache |
Logical. If |
search_pattern |
Character; optional regex pattern to filter which files to consider for deletion. |
verbose |
Logical. If |
The cache is automatically cleared for files older than 24 hours.
Files in the perm subdirectory are not removed automatically and
must be cleared explicitly using clear_perm_cache = TRUE.
Invisible NULL. Messages are printed about what was deleted
and whether the in-memory session cache was cleared.
get_peg_list(), get_nomp_list(), get_shark_codes(), get_dyntaxa_dwca(), get_shark_statistics()
for functions that populate the cache.
# Remove files older than 60 days and clear session cache try(clean_shark4r_cache(days = 60))# Remove files older than 60 days and clear session cache try(clean_shark4r_cache(days = 60))
This function constructs a taxonomy table based on Dyntaxa taxon IDs. It queries the SLU Artdatabanken API (Dyntaxa) to fetch taxonomy information and organizes the data into a hierarchical table.
construct_dyntaxa_table( taxon_ids, subscription_key = Sys.getenv("DYNTAXA_KEY"), shark_output = TRUE, add_parents = TRUE, add_descendants = FALSE, add_descendants_rank = "genus", add_synonyms = TRUE, add_missing_taxa = FALSE, add_hierarchy = FALSE, verbose = TRUE, add_genus_children = deprecated(), recommended_only = deprecated(), parent_ids = deprecated() )construct_dyntaxa_table( taxon_ids, subscription_key = Sys.getenv("DYNTAXA_KEY"), shark_output = TRUE, add_parents = TRUE, add_descendants = FALSE, add_descendants_rank = "genus", add_synonyms = TRUE, add_missing_taxa = FALSE, add_hierarchy = FALSE, verbose = TRUE, add_genus_children = deprecated(), recommended_only = deprecated(), parent_ids = deprecated() )
A valid Dyntaxa API subscription key is required. You can request a free key for the "Taxonomy" service from the ArtDatabanken API portal: https://api-portal.artdatabanken.se/
Note: Please review the API conditions
and register for access before using the API. Data collected through the API
is stored at SLU Artdatabanken. Please also note that the authors of SHARK4R are not affiliated with SLU Artdatabanken.
A tibble representing the constructed taxonomy table.
get_worms_taxonomy_tree for an equivalent WoRMS function
SLU Artdatabanken API Documentation
## Not run: # Construct Dyntaxa taxonomy table for taxon IDs 238366 and 1010380 taxon_ids <- c(238366, 1010380) taxonomy_table <- construct_dyntaxa_table(taxon_ids, "your_subscription_key") print(taxonomy_table) ## End(Not run)## Not run: # Construct Dyntaxa taxonomy table for taxon IDs 238366 and 1010380 taxon_ids <- c(238366, 1010380) taxonomy_table <- construct_dyntaxa_table(taxon_ids, "your_subscription_key") print(taxonomy_table) ## End(Not run)
This function converts geographic coordinates provided in the DDMM format (degrees and minutes) to decimal degrees. It can handle:
DDMM (e.g., 5733 to 57°33' to 57.55°)
DDMMss or DDMMss… (extra digits after minutes are interpreted as fractional minutes, e.g., 573345 to 57°33.45' to 57.5575°)
convert_ddmm_to_dd(coord)convert_ddmm_to_dd(coord)
coord |
A numeric or character vector of coordinates in DDMM format. |
Non-numeric characters are removed before conversion. Coordinates
shorter than 4 digits are returned as NA.
A numeric vector of decimal degrees corresponding to the input coordinates. Names from the input vector are removed.
# Basic DDMM input convert_ddmm_to_dd(c(5733, 6045)) # Input with fractional minutes convert_ddmm_to_dd(c("573345", "604523")) # Input with non-numeric characters convert_ddmm_to_dd(c("57°33'", "60°45'23\""))# Basic DDMM input convert_ddmm_to_dd(c(5733, 6045)) # Input with fractional minutes convert_ddmm_to_dd(c("573345", "604523")) # Input with non-numeric characters convert_ddmm_to_dd(c("57°33'", "60°45'23\""))
Draws a pie chart at each station on a map. When pies would overlap in crowded regions, the pies are displaced asymmetrically away from their true station coordinates and a leader line + anchor dot is drawn so the viewer can still tell which pie belongs to which station. Works with any grouping (phytoplankton groups, zooplankton orders, microbial phyla, ...) and any numeric value (biomass, biovolume, abundance, ...).
create_pie_map( data, station_col = "station_name", lon_col = "sample_longitude_dd", lat_col = "sample_latitude_dd", group_col = "group", value_col = "value", label_col = station_col, group_levels = NULL, group_colors = NULL, group_labels = NULL, radius = 0.28, size_by = NULL, size_range = c(0.15, 0.4), repel = TRUE, min_sep = 2.4, min_disp = 1.6, show_labels = TRUE, label_size = 3, pie_border_color = "white", pie_border_width = 0.3, leader_color = "gray20", leader_width = 0.5, anchor_color = "gray10", anchor_fill = "white", anchor_size = 1.8, basemap = NULL, basemap_source = c("ne", "eea", "obis"), basemap_scale = "medium", basemap_fill = "gray95", basemap_border = "gray70", sea_color = "aliceblue", xlim = NULL, ylim = NULL, pad = 1, title = NULL, legend_title = "Group", verbose = TRUE )create_pie_map( data, station_col = "station_name", lon_col = "sample_longitude_dd", lat_col = "sample_latitude_dd", group_col = "group", value_col = "value", label_col = station_col, group_levels = NULL, group_colors = NULL, group_labels = NULL, radius = 0.28, size_by = NULL, size_range = c(0.15, 0.4), repel = TRUE, min_sep = 2.4, min_disp = 1.6, show_labels = TRUE, label_size = 3, pie_border_color = "white", pie_border_width = 0.3, leader_color = "gray20", leader_width = 0.5, anchor_color = "gray10", anchor_fill = "white", anchor_size = 1.8, basemap = NULL, basemap_source = c("ne", "eea", "obis"), basemap_scale = "medium", basemap_fill = "gray95", basemap_border = "gray70", sea_color = "aliceblue", xlim = NULL, ylim = NULL, pad = 1, title = NULL, legend_title = "Group", verbose = TRUE )
data |
A long-format data.frame with one row per
(station, group). Required columns are configurable through the
|
station_col, lon_col, lat_col, group_col, value_col
|
Column names in
|
label_col |
Column to use for the on-map station label. Defaults to
|
group_levels |
Optional character vector controlling the legend and
slice ordering. Groups not present in |
group_colors |
Optional named character vector of colors, keyed by
group name. If |
group_labels |
Optional named character vector of legend labels,
keyed by group name. Labels may include HTML markup (requires the
|
radius |
Pie radius in latitude degrees. Default |
size_by |
Optional. |
size_range |
Numeric length-2: minimum and maximum radius (in
latitude degrees) when |
repel |
Logical. Run the displacement algorithm? Default |
min_sep |
Minimum center-to-center separation between two pies,
expressed as a multiple of the larger of the two radii. Default |
min_disp |
Minimum displacement for a pie that has been moved at
all, as a multiple of its radius. Default |
show_labels |
Logical. Draw station labels next to each pie? |
label_size |
ggplot text size for the station labels. |
pie_border_color, pie_border_width
|
Aesthetics for the slice borders. |
leader_color, leader_width
|
Aesthetics for the leader segments drawn from anchor to displaced pie edge. |
anchor_color, anchor_fill, anchor_size
|
Aesthetics for the dot drawn at the true station location of each displaced pie. |
basemap |
Optional ggplot layer (or list of layers) used as the
base map. If |
basemap_source |
One of
|
basemap_scale |
Resolution passed to
|
basemap_fill, basemap_border, sea_color
|
Colors for the default coastline basemap. |
xlim, ylim
|
Optional numeric length-2 vectors. If supplied they override the auto-fitted map extent. |
pad |
Padding (in degrees) added around station bounds when auto-fitting the extent. |
title, legend_title
|
Optional plot title and legend title. |
verbose |
Logical. If |
Coastline sources available via basemap_source when basemap = NULL:
"ne" - Natural Earth 1:10m / 1:50m / 1:110m land vectors,
fetched on the fly through rnaturalearth::ne_countries().
Resolution is controlled by basemap_scale. See
https://www.naturalearthdata.com. Requires the
rnaturalearth and rnaturalearthdata packages (both Suggests).
"eea" - high-resolution European coastline from the
European Environment Agency (EEA Coastline 2017). Best choice
for detailed regional maps of European waters. Downloaded
chunked from the EEA arcgis service the first time and cached
locally. Dataset metadata:
https://sdi.eea.europa.eu/catalogue/datahub/api/records/9faa6ea1-372a-4826-a3c7-fb5b05e31c52/formatters/xsl-view?output=pdf&language=eng&approved=true.
"obis" - global land polygon distributed by the Ocean
Biodiversity Information System, downloaded from
https://obis-resources.s3.amazonaws.com/land.gpkg.
The "eea" and "obis" datasets are cached across sessions and can
be cleared with clean_shark4r_cache().
A ggplot object.
# Six SHARK monitoring stations spanning the Swedish west coast, # Kattegat, and Baltic Proper. Note that SLÄGGÖ and Å17 sit close # enough on the Skagerrak shelf that their pies will be repelled and # drawn with leader lines. stations <- dplyr::tibble( station_name = rep(c("SLÄGGÖ", "Å17", "ANHOLT E", "BY2 ARKONA", "BY31 LANDSORTSDJ", "BY38 KARLSÖDJ"), each = 4), sample_latitude_dd = rep(c(58.25984, 58.28434, 56.66866, 54.97116, 58.59366, 57.11717), each = 4), sample_longitude_dd = rep(c(11.43567, 10.50432, 12.11117, 14.09883, 18.23633, 17.66867), each = 4), group = rep(c("Diatoms", "Dinoflagellates", "Cyanobacteria", "Other"), 6), value = c( 60, 25, 10, 5, # SLÄGGÖ 55, 30, 5, 10, # Å17 40, 45, 5, 10, # ANHOLT E 15, 20, 55, 10, # BY2 ARKONA 20, 25, 45, 10, # BY31 LANDSORTSDJ 25, 20, 45, 10) # BY38 KARLSÖDJ ) # The default basemap uses `rnaturalearth` + `rnaturalearthdata` (both # Suggests); each example below is guarded by a single-line check so the # plots render inline rather than all at once after a large `if` block. has_basemap <- requireNamespace("rnaturalearth", quietly = TRUE) && requireNamespace("rnaturalearthdata", quietly = TRUE) # 1. Uniform pie size, default palette, automatic map extent. if (has_basemap) create_pie_map(stations) # 2. Scale pie radius by the station's total value (`size_by = "total"`): # stations with larger summed biomass get bigger pies. `size_range` # controls the smallest/largest radius (in latitude degrees). Pie # sizes are relative within the plot only; no size legend is drawn. if (has_basemap) create_pie_map( stations, size_by = "total", size_range = c(0.20, 0.55), group_colors = c(Diatoms = "#4A90D9", Dinoflagellates = "#E74C3C", Cyanobacteria = "#14B8A6", Other = "#95A5A6"), legend_title = "Taxon group", title = "Phytoplankton composition (pie size = total biomass)" ) # 3. Scale pie radius by an external per-station metric. Here we add a # fake chlorophyll-a column and pass its name to `size_by` - any # numeric per-station column in `data` works (e.g. secchi depth, # cell counts, nutrient concentrations). stations$chla <- rep(c(3.1, 2.8, 4.2, 1.1, 1.5, 1.3), each = 4) if (has_basemap) create_pie_map( stations, size_by = "chla", size_range = c(0.18, 0.50), legend_title = "Taxon group", title = "Pie size scaled by chlorophyll-a" ) # 4. Use the high-resolution EEA coastline instead of the Natural Earth # default. The first call downloads the EEA polygon and caches it # for re-use; subsequent calls are fast. `basemap_scale` is ignored # for EEA. Suitable for regional European maps where Natural Earth's # coastline is too coarse. try(create_pie_map( stations, basemap_source = "eea", legend_title = "Taxon group", title = "High-resolution EEA coastline", verbose = FALSE ))# Six SHARK monitoring stations spanning the Swedish west coast, # Kattegat, and Baltic Proper. Note that SLÄGGÖ and Å17 sit close # enough on the Skagerrak shelf that their pies will be repelled and # drawn with leader lines. stations <- dplyr::tibble( station_name = rep(c("SLÄGGÖ", "Å17", "ANHOLT E", "BY2 ARKONA", "BY31 LANDSORTSDJ", "BY38 KARLSÖDJ"), each = 4), sample_latitude_dd = rep(c(58.25984, 58.28434, 56.66866, 54.97116, 58.59366, 57.11717), each = 4), sample_longitude_dd = rep(c(11.43567, 10.50432, 12.11117, 14.09883, 18.23633, 17.66867), each = 4), group = rep(c("Diatoms", "Dinoflagellates", "Cyanobacteria", "Other"), 6), value = c( 60, 25, 10, 5, # SLÄGGÖ 55, 30, 5, 10, # Å17 40, 45, 5, 10, # ANHOLT E 15, 20, 55, 10, # BY2 ARKONA 20, 25, 45, 10, # BY31 LANDSORTSDJ 25, 20, 45, 10) # BY38 KARLSÖDJ ) # The default basemap uses `rnaturalearth` + `rnaturalearthdata` (both # Suggests); each example below is guarded by a single-line check so the # plots render inline rather than all at once after a large `if` block. has_basemap <- requireNamespace("rnaturalearth", quietly = TRUE) && requireNamespace("rnaturalearthdata", quietly = TRUE) # 1. Uniform pie size, default palette, automatic map extent. if (has_basemap) create_pie_map(stations) # 2. Scale pie radius by the station's total value (`size_by = "total"`): # stations with larger summed biomass get bigger pies. `size_range` # controls the smallest/largest radius (in latitude degrees). Pie # sizes are relative within the plot only; no size legend is drawn. if (has_basemap) create_pie_map( stations, size_by = "total", size_range = c(0.20, 0.55), group_colors = c(Diatoms = "#4A90D9", Dinoflagellates = "#E74C3C", Cyanobacteria = "#14B8A6", Other = "#95A5A6"), legend_title = "Taxon group", title = "Phytoplankton composition (pie size = total biomass)" ) # 3. Scale pie radius by an external per-station metric. Here we add a # fake chlorophyll-a column and pass its name to `size_by` - any # numeric per-station column in `data` works (e.g. secchi depth, # cell counts, nutrient concentrations). stations$chla <- rep(c(3.1, 2.8, 4.2, 1.1, 1.5, 1.3), each = 4) if (has_basemap) create_pie_map( stations, size_by = "chla", size_range = c(0.18, 0.50), legend_title = "Taxon group", title = "Pie size scaled by chlorophyll-a" ) # 4. Use the high-resolution EEA coastline instead of the Natural Earth # default. The first call downloads the EEA polygon and caches it # for re-use; subsequent calls are fast. `basemap_scale` is ignored # for EEA. Suitable for regional European maps where Natural Earth's # coastline is too coarse. try(create_pie_map( stations, basemap_source = "eea", legend_title = "Taxon group", title = "High-resolution EEA coastline", verbose = FALSE ))
Identifies which columns are mandatory in the SHARK delivery template based on rows starting with "*" (one or more). You can specify how many levels of asterisks to include.
find_required_fields( datatype, stars = 1, bacterioplankton_subtype = "abundance" )find_required_fields( datatype, stars = 1, bacterioplankton_subtype = "abundance" )
datatype |
Character. The datatype name. Available options include:
|
stars |
Integer. Maximum number of "" levels to include.
Default = 1 (only single "").
For example, |
bacterioplankton_subtype |
Character. For "Bacterioplankton" only: either "abundance" (default) or "production". Ignored for other datatypes. |
Note: A single "*" marks required fields in the standard SHARK template. A double "**" is often used to specify columns required for national monitoring only. For more information, see: https://www.smhi.se/data/hav-och-havsmiljo/datavardskap-oceanografi-och-marinbiologi/leverera-data
A character vector of column names that are required in the template.
# Only single "*" required columns try(find_required_fields("Bacterioplankton")) # Include both "*" and "**" required columns (national monitoring too) try(find_required_fields("Bacterioplankton", stars = 2)) # Include up to three levels of "*" try(find_required_fields("Phytoplankton", stars = 3))# Only single "*" required columns try(find_required_fields("Bacterioplankton")) # Include both "*" and "**" required columns (national monitoring too) try(find_required_fields("Bacterioplankton", stars = 2)) # Include up to three levels of "*" try(find_required_fields("Phytoplankton", stars = 3))
Downloads and reads the SHARK Excel delivery template for a given datatype. The template contains the column definitions and headers used for submission.
get_delivery_template( datatype, sheet = "Kolumner", header_row = 4, skip = 1, bacterioplankton_subtype = "abundance", force = FALSE, clean_cache_days = 1 )get_delivery_template( datatype, sheet = "Kolumner", header_row = 4, skip = 1, bacterioplankton_subtype = "abundance", force = FALSE, clean_cache_days = 1 )
datatype |
Character. The datatype name. Available options include:
|
sheet |
Character or numeric. Name (e.g., "Kolumner") or index (e.g., 1) of the sheet in the Excel file to read. Default is "Kolumner". |
header_row |
Integer. Row number in the Excel file that contains the column headers. Default is 4. |
skip |
Integer. Number of rows to skip before reading data. Default is 1. |
bacterioplankton_subtype |
Character. For "Bacterioplankton" only: either "abundance" (default) or "production". Ignored for other datatypes. |
force |
Logical; if |
clean_cache_days |
Numeric; if not |
A tibble containing the delivery template. Column names are set
from the header row.
# Bacterioplankton abundance try(abun <- get_delivery_template("Bacterioplankton", bacterioplankton_subtype = "abundance")) if (exists("abun")) print(abun) # Bacterioplankton production try(prod <- get_delivery_template("Bacterioplankton", bacterioplankton_subtype = "production")) # Phytoplankton template try(phyto <- get_delivery_template("Phytoplankton")) # Phytoplankton column explanation (sheet number 3) try(phyto_column_explanation <- get_delivery_template("Phytoplankton", sheet = 3, header_row = 4, skip = 3)) if (exists("phyto_column_explanation")) print(phyto_column_explanation)# Bacterioplankton abundance try(abun <- get_delivery_template("Bacterioplankton", bacterioplankton_subtype = "abundance")) if (exists("abun")) print(abun) # Bacterioplankton production try(prod <- get_delivery_template("Bacterioplankton", bacterioplankton_subtype = "production")) # Phytoplankton template try(phyto <- get_delivery_template("Phytoplankton")) # Phytoplankton column explanation (sheet number 3) try(phyto_column_explanation <- get_delivery_template("Phytoplankton", sheet = 3, header_row = 4, skip = 3)) if (exists("phyto_column_explanation")) print(phyto_column_explanation)
This function downloads a complete Darwin Core Archive (DwCA) of Dyntaxa from the SLU Artdatabanken API, extracts the archive, and reads the specified CSV file into R.
get_dyntaxa_dwca( subscription_key = Sys.getenv("DYNTAXA_KEY"), file_to_read = "Taxon.csv", force = FALSE, verbose = TRUE )get_dyntaxa_dwca( subscription_key = Sys.getenv("DYNTAXA_KEY"), file_to_read = "Taxon.csv", force = FALSE, verbose = TRUE )
subscription_key |
A Dyntaxa API subscription key. By default, the key
is read from the environment variable You can provide the key in three ways:
|
file_to_read |
A string specifying the name of the CSV file to read from the extracted archive.
Allowed options are: |
force |
A logical value indicating whether to force a fresh download of the archive,
even if a cached copy is available. Defaults to |
verbose |
A logical value indicating whether to show download progress. Defaults to |
By default, the archive is downloaded and cached across R sessions. On subsequent calls,
the function reuses the cached copy of the extracted files to avoid repeated downloads.
Use the force parameter to re-download the archive if needed. The cache is cleared
automatically after 24 hours, but you can also manually clear it using
clean_shark4r_cache.
A valid Dyntaxa API subscription key is required. You can request a free key for the "Taxonomy" service from the ArtDatabanken API portal: https://api-portal.artdatabanken.se/
Note: Please review the API conditions
and register for access before using the API. Data collected through the API
is stored at SLU Artdatabanken. Please also note that the authors of SHARK4R are not affiliated with SLU Artdatabanken.
A tibble containing the data from the specified CSV file.
clean_shark4r_cache() to manually clear cached files.
## Not run: # Provide your Dyntaxa API subscription key subscription_key <- "your_subscription_key" # Download and read the Taxon.csv file taxon_data <- get_dyntaxa_dwca(subscription_key, file_to_read = "Taxon.csv") ## End(Not run)## Not run: # Provide your Dyntaxa API subscription key subscription_key <- "your_subscription_key" # Download and read the Taxon.csv file taxon_data <- get_dyntaxa_dwca(subscription_key, file_to_read = "Taxon.csv") ## End(Not run)
This function queries the SLU Artdatabanken API (Dyntaxa) to retrieve parent taxon IDs for the specified taxon IDs. It constructs a request with the provided taxon IDs, sends the request to the SLU Artdatabanken API, and processes the response to return a list of parent taxon IDs.
get_dyntaxa_parent_ids( taxon_ids, subscription_key = Sys.getenv("DYNTAXA_KEY"), verbose = TRUE )get_dyntaxa_parent_ids( taxon_ids, subscription_key = Sys.getenv("DYNTAXA_KEY"), verbose = TRUE )
taxon_ids |
A vector of numeric taxon IDs for which parent taxon IDs are requested. |
subscription_key |
A Dyntaxa API subscription key. By default, the key
is read from the environment variable You can provide the key in three ways:
|
verbose |
Logical. Default is TRUE. |
A valid Dyntaxa API subscription key is required. You can request a free key for the "Taxonomy" service from the ArtDatabanken API portal: https://api-portal.artdatabanken.se/
Note: Please review the API conditions
and register for access before using the API. Data collected through the API
is stored at SLU Artdatabanken. Please also note that the authors of SHARK4R are not affiliated with SLU Artdatabanken.
A list containing parent taxon IDs corresponding to the specified taxon IDs.
SLU Artdatabanken API Documentation
## Not run: # Get parent taxon IDs for taxon IDs 238366 and 1010380 parent_ids <- get_dyntaxa_parent_ids(c(238366, 1010380), "your_subscription_key") print(parent_ids) ## End(Not run)## Not run: # Get parent taxon IDs for taxon IDs 238366 and 1010380 parent_ids <- get_dyntaxa_parent_ids(c(238366, 1010380), "your_subscription_key") print(parent_ids) ## End(Not run)
This function queries the SLU Artdatabanken API (Dyntaxa) to retrieve taxonomic information for the specified taxon IDs. It constructs a request with the provided taxon IDs, sends the request to the SLU Artdatabanken API, and processes the response to return taxonomic information in a data frame.
get_dyntaxa_records(taxon_ids, subscription_key = Sys.getenv("DYNTAXA_KEY"))get_dyntaxa_records(taxon_ids, subscription_key = Sys.getenv("DYNTAXA_KEY"))
taxon_ids |
A vector of numeric taxon IDs (Dyntaxa ID) for which taxonomic information is requested. |
subscription_key |
A Dyntaxa API subscription key. By default, the key
is read from the environment variable You can provide the key in three ways:
|
A valid Dyntaxa API subscription key is required. You can request a free key for the "Taxonomy" service from the ArtDatabanken API portal: https://api-portal.artdatabanken.se/
Note: Please review the API conditions
and register for access before using the API. Data collected through the API
is stored at SLU Artdatabanken. Please also note that the authors of SHARK4R are not affiliated with SLU Artdatabanken.
A tibble containing taxonomic information for the specified taxon IDs.
Columns include taxonId, names, category, rank, isRecommended, and parentTaxonId.
SLU Artdatabanken API Documentation
## Not run: # Get taxonomic information for taxon IDs 238366 and 1010380 taxon_info <- get_dyntaxa_records(c(238366, 1010380), "your_subscription_key") print(taxon_info) ## End(Not run)## Not run: # Get taxonomic information for taxon IDs 238366 and 1010380 taxon_info <- get_dyntaxa_records(c(238366, 1010380), "your_subscription_key") print(taxon_info) ## End(Not run)
This function retrieves the IOC-UNESCO Taxonomic Reference List of Harmful Microalgae (Lundholm et al. 2009) from the World Register of Marine Species (WoRMS). The data is returned as a dataframe, with options to customize the fields included in the download.
get_hab_list( species_only = TRUE, harmful_non_toxic_only = FALSE, aphia_id = TRUE, scientific_name = TRUE, authority = TRUE, fossil = TRUE, rank_name = TRUE, status_name = TRUE, qualitystatus_name = TRUE, modified = TRUE, lsid = TRUE, parent_id = TRUE, stored_path = TRUE, citation = TRUE, classification = TRUE, environment = TRUE, accepted_taxon = TRUE, verbose = TRUE )get_hab_list( species_only = TRUE, harmful_non_toxic_only = FALSE, aphia_id = TRUE, scientific_name = TRUE, authority = TRUE, fossil = TRUE, rank_name = TRUE, status_name = TRUE, qualitystatus_name = TRUE, modified = TRUE, lsid = TRUE, parent_id = TRUE, stored_path = TRUE, citation = TRUE, classification = TRUE, environment = TRUE, accepted_taxon = TRUE, verbose = TRUE )
This function submits a POST request to the WoRMS database to retrieve the IOC-UNESCO Taxonomic Reference List of Harmful Microalgae.
The downloaded data can include various fields, which are controlled by the input parameters.
If a field is not required, set the corresponding parameter to FALSE to exclude it from the output.
A tibble containing the HABs taxonomic list, with columns based on the selected parameters.
Lundholm, N.; Bernard, C.; Churro, C.; Escalera, L.; Hoppenrath, M.; Iwataki, M.; Larsen, J.; Mertens, K.; Murray, S.; Probert, I.; Salas, R.; Tillmann, U.; Zingone, A. (Eds) (2009 onwards). IOC-UNESCO Taxonomic Reference List of Harmful Microalgae. https://www.marinespecies.org/hab/. doi:10.14284/362
https://www.marinespecies.org/hab/ for IOC-UNESCO Taxonomic Reference List of Harmful Microalgae
# Download the default HABs taxonomic list try(habs_taxlist_df <- get_hab_list()) if (exists("habs_taxlist_df")) head(habs_taxlist_df) # Include higher taxa records try(habs_taxlist_df <- get_hab_list(species_only = FALSE)) if (exists("habs_taxlist_df")) head(habs_taxlist_df) # Retrieve only non-toxigenic harmful species (experimental stage) try(habs_taxlist_df <- get_hab_list(harmful_non_toxic_only = TRUE, verbose = FALSE)) if (exists("habs_taxlist_df")) head(habs_taxlist_df) # Include only specific fields in the output try(habs_taxlist_df <- get_hab_list(aphia_id = TRUE, scientific_name = TRUE, authority = FALSE)) if (exists("habs_taxlist_df")) head(habs_taxlist_df)# Download the default HABs taxonomic list try(habs_taxlist_df <- get_hab_list()) if (exists("habs_taxlist_df")) head(habs_taxlist_df) # Include higher taxa records try(habs_taxlist_df <- get_hab_list(species_only = FALSE)) if (exists("habs_taxlist_df")) head(habs_taxlist_df) # Retrieve only non-toxigenic harmful species (experimental stage) try(habs_taxlist_df <- get_hab_list(harmful_non_toxic_only = TRUE, verbose = FALSE)) if (exists("habs_taxlist_df")) head(habs_taxlist_df) # Include only specific fields in the output try(habs_taxlist_df <- get_hab_list(aphia_id = TRUE, scientific_name = TRUE, authority = FALSE)) if (exists("habs_taxlist_df")) head(habs_taxlist_df)
This function downloads the latest available Nordic Marine Phytoplankton Group (NOMP) biovolume zip archive from SMHI, unzips it, and reads the first Excel file by default. You can also specify which file in the archive to read.
get_nomp_list( year = as.numeric(format(Sys.Date(), "%Y")), file = NULL, sheet = NULL, force = FALSE, base_url = NULL, clean_cache_days = 30, verbose = TRUE )get_nomp_list( year = as.numeric(format(Sys.Date(), "%Y")), file = NULL, sheet = NULL, force = FALSE, base_url = NULL, clean_cache_days = 30, verbose = TRUE )
year |
Numeric year to download. Default is current year; if not available, previous years are automatically tried. |
file |
Character string specifying which file in the zip archive to read. Defaults to the first Excel file in the archive. |
sheet |
Character or numeric; the name or index of the sheet to read from the Excel file. If neither argument specifies the sheet, defaults to the first sheet. |
force |
Logical; if |
base_url |
Base URL (without "/nomp_taxa_biovolumes_and_carbon_YYYY.zip") for the NOMP biovolume files. Defaults to the SMHI directory. |
clean_cache_days |
Numeric; if not |
verbose |
A logical indicating whether to print progress messages. Default is TRUE. |
A tibble with the contents of the requested Excel file.
clean_shark4r_cache() to manually clear cached files.
# Get the latest available list try(nomp_list <- get_nomp_list()) if (exists("nomp_list")) head(nomp_list) # Get the 2023 list and clean old cache files older than 60 days try(nomp_list_2023 <- get_nomp_list(2023, clean_cache_days = 60)) if (exists("nomp_list_2023")) head(nomp_list_2023)# Get the latest available list try(nomp_list <- get_nomp_list()) if (exists("nomp_list")) head(nomp_list) # Get the 2023 list and clean old cache files older than 60 days try(nomp_list_2023 <- get_nomp_list(2023, clean_cache_days = 60)) if (exists("nomp_list_2023")) head(nomp_list_2023)
This function retrieves external links related to algae taxa from the Nordic Microalgae API. It takes a vector of slugs (taxon identifiers) and returns a data frame containing the external links associated with each taxon. The data includes the provider, label, external ID, and the URL of the external link.
get_nua_external_links(slug, verbose = TRUE, unparsed = FALSE)get_nua_external_links(slug, verbose = TRUE, unparsed = FALSE)
slug |
A vector of taxon slugs (identifiers) for which to retrieve external links. |
verbose |
A logical flag indicating whether to display a progress bar. Default is |
unparsed |
Logical. If |
The slugs (taxon identifiers) used in this function can be retrieved using the get_nua_taxa() function,
which returns a data frame with a column for taxon slugs, along with other relevant metadata for each taxon.
When unparsed = FALSE: a tibble containing the following columns:
slug |
The slug (identifier) of the taxon. |
provider |
The provider of the external link. |
label |
The label of the external link. |
external_id |
The external ID associated with the external link. |
external_url |
The URL of the external link. |
collection |
The collection category, which is "External Links" for all rows. |
https://nordicmicroalgae.org/ for Nordic Microalgae website.
https://nordicmicroalgae.org/api/ for Nordic Microalgae API documentation.
# Retrieve external links for a vector of slugs slugs <- c("chaetoceros-debilis", "alexandrium-tamarense") try(external_links <- get_nua_external_links( slug = slugs, verbose = FALSE )) if (exists("external_links")) head(external_links)# Retrieve external links for a vector of slugs slugs <- c("chaetoceros-debilis", "alexandrium-tamarense") try(external_links <- get_nua_external_links( slug = slugs, verbose = FALSE )) if (exists("external_links")) head(external_links)
This function retrieves harmfulness information related to algae taxa from the Nordic Microalgae API. It takes a vector of slugs (taxon identifiers) and returns a data frame containing the harmfulness information associated with each taxon. The data includes the provider, label, external ID, and the URL of the external link.
get_nua_harmfulness(slug, verbose = TRUE)get_nua_harmfulness(slug, verbose = TRUE)
slug |
A vector of taxon slugs (identifiers) for which to retrieve external links. |
verbose |
A logical flag indicating whether to display a progress bar. Default is |
The slugs (taxon identifiers) used in this function can be retrieved using the get_nua_taxa() function,
which returns a data frame with a column for taxon slugs, along with other relevant metadata for each taxon.
A tibble containing the following columns:
slug |
The slug (identifier) of the taxon. |
provider |
The provider of the external link. |
label |
The label of the external link. |
external_id |
The external ID associated with the external link. |
external_url |
The URL of the external link. |
collection |
The collection category, which is "Harmful algae blooms" for all rows. |
https://nordicmicroalgae.org/ for Nordic Microalgae website.
https://nordicmicroalgae.org/api/ for Nordic Microalgae API documentation.
# Retrieve external links for a vector of slugs try(harmfulness <- get_nua_harmfulness(slug = c("dinophysis-acuta", "alexandrium-ostenfeldii"), verbose = FALSE)) if (exists("harmfulness")) print(harmfulness)# Retrieve external links for a vector of slugs try(harmfulness <- get_nua_harmfulness(slug = c("dinophysis-acuta", "alexandrium-ostenfeldii"), verbose = FALSE)) if (exists("harmfulness")) print(harmfulness)
This function retrieves media URLs for automated imaging images from the Nordic Microalgae API. These are images from automated imaging instruments (e.g., IFCB) used for image labeling purposes. It returns URLs for different renditions (large, medium, original, small) along with basic attribution.
get_nua_image_labeling_links(unparsed = FALSE)get_nua_image_labeling_links(unparsed = FALSE)
unparsed |
Logical. If |
When unparsed = FALSE: a tibble with the following columns:
slug: The slug of the related taxon.
image_l_url: The URL for the "large" rendition.
image_o_url: The URL for the "original" rendition.
image_s_url: The URL for the "small" rendition.
image_m_url: The URL for the "medium" rendition.
contributor: The contributor of the media item.
copyright_holder: The copyright holder.
license: The license of the media item.
imaging_instrument: Comma-separated list of imaging instruments.
priority: The priority of the image.
https://nordicmicroalgae.org/ for Nordic Microalgae website.
https://nordicmicroalgae.org/api/ for Nordic Microalgae API documentation.
get_nua_image_labeling_metadata for retrieving full metadata for image labeling images.
get_nua_media_links for retrieving regular media image URLs.
# Retrieve image labeling media links try(il_links <- get_nua_image_labeling_links(unparsed = FALSE)) # Preview the extracted data if (exists("il_links")) head(il_links)# Retrieve image labeling media links try(il_links <- get_nua_image_labeling_links(unparsed = FALSE)) # Preview the extracted data if (exists("il_links")) head(il_links)
This function retrieves detailed metadata for automated imaging images from the Nordic Microalgae API. These are images from automated imaging instruments (e.g., IFCB) used for image labeling purposes. It returns comprehensive metadata including location, instrument, dataset, and taxonomic information.
get_nua_image_labeling_metadata(unparsed = FALSE)get_nua_image_labeling_metadata(unparsed = FALSE)
unparsed |
Logical. If |
When unparsed = FALSE: a tibble with the following columns:
slug: The slug of the media item.
taxon_slug: The slug of the related taxon.
scientific_name: The scientific name of the related taxon.
file: The filename of the media item.
type: The MIME type of the media item.
title: The title of the media item.
caption: The caption of the media item.
license: The license of the media item.
location: The location where the media was collected.
contributor: The contributor of the media item.
copyright_holder: The copyright holder.
imaging_instrument: Comma-separated list of imaging instruments.
training_dataset: DOI or URL of the training dataset.
sampling_date: The date the sample was collected.
geographic_area: The geographic area of collection.
latitude_degree: The latitude in degrees.
longitude_degree: The longitude in degrees.
institute: Comma-separated list of institutes.
contributing_organisation: The contributing organisation.
priority: The priority of the image.
created_at: The creation timestamp.
updated_at: The last update timestamp.
https://nordicmicroalgae.org/ for Nordic Microalgae website.
https://nordicmicroalgae.org/api/ for Nordic Microalgae API documentation.
get_nua_image_labeling_links for retrieving image labeling media URLs.
get_nua_media_metadata for retrieving regular media metadata.
# Retrieve image labeling metadata try(il_metadata <- get_nua_image_labeling_metadata(unparsed = FALSE)) # Preview the extracted data if (exists("il_metadata")) head(il_metadata)# Retrieve image labeling metadata try(il_metadata <- get_nua_image_labeling_metadata(unparsed = FALSE)) # Preview the extracted data if (exists("il_metadata")) head(il_metadata)
This function retrieves media information from the Nordic Microalgae API and extracts slugs and URLs for different renditions (large, original, small, medium) for each media item.
get_nua_media_links(unparsed = FALSE)get_nua_media_links(unparsed = FALSE)
unparsed |
Logical. If |
When unparsed = FALSE: a tibble with the following columns:
slug: The slug of the related taxon.
l_url: The URL for the "large" rendition.
o_url: The URL for the "original" rendition.
s_url: The URL for the "small" rendition.
m_url: The URL for the "medium" rendition.
https://nordicmicroalgae.org/ for Nordic Microalgae website.
https://nordicmicroalgae.org/api/ for Nordic Microalgae API documentation.
# Retrieve media information try(media_info <- get_nua_media_links(unparsed = FALSE)) # Preview the extracted data if (exists("media_info")) head(media_info)# Retrieve media information try(media_info <- get_nua_media_links(unparsed = FALSE)) # Preview the extracted data if (exists("media_info")) head(media_info)
This function retrieves metadata for media items from the Nordic Microalgae API. It returns detailed attributes such as title, caption, location, sampling date, geographic coordinates, and contributor information.
get_nua_media_metadata(unparsed = FALSE)get_nua_media_metadata(unparsed = FALSE)
unparsed |
Logical. If |
When unparsed = FALSE: a tibble with the following columns:
slug: The slug of the media item.
taxon_slug: The slug of the related taxon.
scientific_name: The scientific name of the related taxon.
file: The filename of the media item.
type: The MIME type of the media item.
title: The title of the media item.
caption: The caption of the media item.
license: The license of the media item.
location: The location where the media was collected.
contributor: The contributor of the media item.
photographer_artist: The photographer or artist.
copyright_holder: The copyright holder.
copyright_stamp: The copyright stamp.
galleries: Comma-separated list of galleries.
technique: The imaging technique used.
contrast_enhancement: The contrast enhancement method used.
preservation: The preservation method used.
stain: The stain used.
sampling_date: The date the sample was collected.
geographic_area: The geographic area of collection.
latitude_degree: The latitude in degrees.
longitude_degree: The longitude in degrees.
institute: Comma-separated list of institutes.
contributing_organisation: The contributing organisation.
created_at: The creation timestamp.
updated_at: The last update timestamp.
https://nordicmicroalgae.org/ for Nordic Microalgae website.
https://nordicmicroalgae.org/api/ for Nordic Microalgae API documentation.
get_nua_media_links for retrieving media image URLs.
# Retrieve media metadata try(media_metadata <- get_nua_media_metadata(unparsed = FALSE)) # Preview the extracted data if (exists("media_metadata")) head(media_metadata)# Retrieve media metadata try(media_metadata <- get_nua_media_metadata(unparsed = FALSE)) # Preview the extracted data if (exists("media_metadata")) head(media_metadata)
This function retrieves all taxonomic information for algae taxa from the Nordic Microalgae API. It fetches details including scientific names, authorities, ranks, and image URLs (in different sizes: large, medium, original, and small).
get_nua_taxa(unparsed = FALSE)get_nua_taxa(unparsed = FALSE)
unparsed |
Logical. If |
When unparsed = FALSE: a tibble containing the following columns:
slug |
A unique identifier for the taxon. |
scientific_name |
The scientific name of the taxon. |
authority |
The authority associated with the scientific name. |
rank |
The taxonomic rank of the taxon. |
https://nordicmicroalgae.org/ for Nordic Microalgae website.
https://nordicmicroalgae.org/api/ for Nordic Microalgae API documentation.
# Retrieve and display taxa data try(taxa_data <- get_nua_taxa(unparsed = FALSE)) if (exists("taxa_data")) head(taxa_data)# Retrieve and display taxa data try(taxa_data <- get_nua_taxa(unparsed = FALSE)) if (exists("taxa_data")) head(taxa_data)
This function downloads the EG-Phyto (previously PEG) biovolume zip archive from ICES (using
cache_peg_zip()), unzips it, and reads the first Excel file by default.
You can also specify which file in the archive to read.
get_peg_list( file = NULL, sheet = NULL, force = FALSE, url = "https://www.ices.dk/data/Documents/ENV/PEG_BVOL.zip", clean_cache_days = 30, verbose = TRUE )get_peg_list( file = NULL, sheet = NULL, force = FALSE, url = "https://www.ices.dk/data/Documents/ENV/PEG_BVOL.zip", clean_cache_days = 30, verbose = TRUE )
file |
Character string specifying which file in the zip archive to read. Defaults to the first Excel file in the archive. |
sheet |
Character or numeric; the name or index of the sheet to read from the Excel file. If neither argument specifies the sheet, defaults to the first sheet. |
force |
Logical; if |
url |
Character string with the URL of the PEG zip file. Defaults to the official ICES link. |
clean_cache_days |
Numeric; if not |
verbose |
A logical indicating whether to print progress messages. Default is TRUE. |
A tibble with the contents of the requested Excel file.
clean_shark4r_cache() to manually clear cached files.
# Read the first Excel file from the PEG zip try(peg_list <- get_peg_list()) if (exists("peg_list")) head(peg_list)# Read the first Excel file from the PEG zip try(peg_list <- get_peg_list()) if (exists("peg_list")) head(peg_list)
This function downloads the SHARK codes Excel file from SMHI (if not already cached) and reads it into R. The file is stored in a persistent cache directory so it does not need to be downloaded again in subsequent sessions.
get_shark_codes( url = "https://smhi.se/oceanografi/oce_info_data/shark_web/downloads/codelist_SMHI.xlsx", sheet = 1, skip = 1, force = FALSE, clean_cache_days = 30 )get_shark_codes( url = "https://smhi.se/oceanografi/oce_info_data/shark_web/downloads/codelist_SMHI.xlsx", sheet = 1, skip = 1, force = FALSE, clean_cache_days = 30 )
url |
Character string with the URL to the SHARK codes Excel file. Defaults to the official SMHI codelist. |
sheet |
Sheet to read. Can be either the sheet name or its index
(default is |
skip |
Number of rows to skip before reading data
(default is |
force |
Logical; if |
clean_cache_days |
Numeric; if not |
A tibble containing the contents of the requested sheet.
clean_shark4r_cache() to manually clear cached files.
# Read the first sheet, skipping the first row try(codes <- get_shark_codes()) if (exists("codes")) head(codes) # Force re-download of the Excel file try(codes <- get_shark_codes(force = TRUE))# Read the first sheet, skipping the first row try(codes <- get_shark_codes()) if (exists("codes")) head(codes) # Force re-download of the Excel file try(codes <- get_shark_codes(force = TRUE))
The get_shark_data() function retrieves tabular data from the SHARK database hosted by SMHI. The function sends a POST request
to the SHARK API with customizable filters, including year, month, taxon name, water category, and more, and returns the
retrieved data as a structured tibble. To view available filter options, see get_shark_options.
get_shark_data( tableView = "sharkweb_overview", headerLang = "internal_key", save_data = FALSE, file_path = NULL, delimiters = "point-tab", lineEnd = "win", encoding = "utf_8", dataTypes = c(), bounds = c(), fromYear = NULL, toYear = NULL, months = c(), parameters = c(), checkStatus = "", qualityFlags = c(), deliverers = c(), orderers = c(), projects = c(), datasets = c(), minSamplingDepth = "", maxSamplingDepth = "", redListedCategory = c(), taxonName = c(), stationName = c(), vattenDistrikt = c(), seaBasins = c(), counties = c(), municipalities = c(), waterCategories = c(), typOmraden = c(), helcomOspar = c(), seaAreas = c(), hideEmptyColumns = FALSE, row_limit = 10^7, prod = TRUE, utv = FALSE, verbose = TRUE )get_shark_data( tableView = "sharkweb_overview", headerLang = "internal_key", save_data = FALSE, file_path = NULL, delimiters = "point-tab", lineEnd = "win", encoding = "utf_8", dataTypes = c(), bounds = c(), fromYear = NULL, toYear = NULL, months = c(), parameters = c(), checkStatus = "", qualityFlags = c(), deliverers = c(), orderers = c(), projects = c(), datasets = c(), minSamplingDepth = "", maxSamplingDepth = "", redListedCategory = c(), taxonName = c(), stationName = c(), vattenDistrikt = c(), seaBasins = c(), counties = c(), municipalities = c(), waterCategories = c(), typOmraden = c(), helcomOspar = c(), seaAreas = c(), hideEmptyColumns = FALSE, row_limit = 10^7, prod = TRUE, utv = FALSE, verbose = TRUE )
tableView |
Character. Specifies the columns of the table to retrieve. Options include:
Default is |
headerLang |
Character. Language option for column headers. Possible values:
|
save_data |
Logical. If |
file_path |
Character. The file path where the data should be saved. Required if |
delimiters |
Character. Specifies the delimiter used to separate values in the file, if |
lineEnd |
Character. Defines the type of line endings in the file, if |
encoding |
Character. Sets the file's text encoding, if |
dataTypes |
Character vector. Specifies data types to filter. Possible values include:
|
bounds |
A numeric vector of length 4 specifying the geographical search boundaries in decimal degrees,
formatted as |
fromYear |
Integer (optional). The starting year for data retrieval.
If set to |
toYear |
Integer (optional). The ending year for data retrieval.
If set to |
months |
Integer vector. The months to retrieve data for, e.g., |
parameters |
Character vector. Optional parameters to filter the results by, such as |
checkStatus |
Character string. Optional status check to filter results. |
qualityFlags |
Character vector. Specifies the quality flags to filter the data. By default, all data are included, including those with the "B" flag (Bad). |
deliverers |
Character vector. Specifies the data deliverers to filter by. |
orderers |
Character vector. Orderers to filter by specific organizations or individuals. |
projects |
Character vector. Projects to filter data by specific research or monitoring projects. |
datasets |
Character vector. Datasets to filter data by specific datasets. |
minSamplingDepth |
Numeric. Minimum sampling depth (in meters) to filter the data. |
maxSamplingDepth |
Numeric. Maximum sampling depth (in meters) to filter the data. |
redListedCategory |
Character vector. Red-listed taxa for conservation filtering. |
taxonName |
Character vector. Optional vector of taxa names to filter by. |
stationName |
Character vector. Station names to filter data by specific stations. |
vattenDistrikt |
Character vector. Water district names to filter by Swedish water districts. |
seaBasins |
Character vector. Sea basins to filter by. |
counties |
Character vector. Counties to filter by specific administrative regions. |
municipalities |
Character vector. Municipalities to filter by. |
waterCategories |
Character vector. Water categories to filter by. |
typOmraden |
Character vector. Type areas to filter by. |
helcomOspar |
Character vector. HELCOM or OSPAR areas for regional filtering. |
seaAreas |
Character vector. Sea area codes to filter by specific sea areas. |
hideEmptyColumns |
Logical. Whether to hide empty columns. Default is FALSE. |
row_limit |
Numeric. Specifies the maximum number of rows that can be retrieved in a single request.
If the requested data exceeds this limit, the function automatically downloads the data in yearly chunks
(ignored when |
prod |
Logical, whether to download from the production
( |
utv |
Logical. Select UTV server when |
verbose |
Logical. Whether to display progress information. Default is TRUE. |
This function sends a POST request to the SHARK API with the specified filters.
The API returns a delimited text file (e.g., tab- or semicolon-separated), which is
downloaded and read into R as a tibble. If the row_limit parameter is exceeded,
the data is retrieved in yearly chunks and combined into a single table. Adjusting the
row_limit parameter may be necessary when retrieving large datasets or detailed reports.
Note that making very large requests (e.g., retrieving the entire SHARK database)
can be extremely time- and memory-intensive.
A tibble containing the retrieved SHARK data, parsed from
the API's delimited text response. Column types are inferred automatically.
For large queries spanning multiple years or including several data types, retrieval can be time-consuming and memory-intensive. Consider filtering by year, data type, or region for improved performance.
https://shark.smhi.se/en – SHARK database portal
get_shark_options() – Retrieve available filters
get_shark_table_counts() – Check table row counts before download
get_shark_datasets() – To download datasets as zip-archives
# Retrieve chlorophyll data from 2019 to 2020 for April to June try(shark_data <- get_shark_data(fromYear = 2019, toYear = 2020, months = c(4, 5, 6), dataTypes = "Chlorophyll", verbose = FALSE)) if (exists("shark_data")) print(shark_data)# Retrieve chlorophyll data from 2019 to 2020 for April to June try(shark_data <- get_shark_data(fromYear = 2019, toYear = 2020, months = c(4, 5, 6), dataTypes = "Chlorophyll", verbose = FALSE)) if (exists("shark_data")) print(shark_data)
Downloads one or more datasets (zip-archives) from the SHARK database (Swedish national marine environmental data archive) and optionally unzips them. The function matches provided dataset names against all available SHARK datasets.
get_shark_datasets( dataset_name, save_dir = NULL, prod = TRUE, utv = FALSE, unzip_file = FALSE, return_df = FALSE, encoding = "latin_1", guess_encoding = TRUE, verbose = TRUE )get_shark_datasets( dataset_name, save_dir = NULL, prod = TRUE, utv = FALSE, unzip_file = FALSE, return_df = FALSE, encoding = "latin_1", guess_encoding = TRUE, verbose = TRUE )
dataset_name |
Character vector with one or more dataset
names (or partial names). Each entry will be matched against
available SHARK dataset identifiers (e.g.,
|
save_dir |
Directory where zip files (and optionally their
extracted contents) should be stored. Defaults to |
prod |
Logical, whether to download from the production
( |
utv |
Logical. Select UTV server when |
unzip_file |
Logical, whether to extract downloaded zip
archives ( |
return_df |
Logical, whether to return a combined data frame
with the contents of all downloaded datasets ( |
encoding |
Character. File encoding of |
guess_encoding |
Logical. If |
verbose |
Logical, whether to show download and extraction
progress messages. Default is |
If return_df = FALSE, a named list of character vectors.
Each element corresponds to one matched dataset and contains either
the path to the downloaded zip file (if unzip_file = FALSE) or
the path to the extraction directory (if unzip_file = TRUE).
If return_df = TRUE, a single combined data frame with all
dataset contents, including a source column indicating the dataset.
https://shark.smhi.se/en for SHARK database.
get_shark_options() for listing available datasets.
get_shark_data() for downloading tabular data.
# Get a specific dataset try(get_shark_datasets("SHARK_Phytoplankton_2023_SMHI_BVVF", verbose = FALSE)) # Get all Zooplankton datasets from 2022 and unzip them try(get_shark_datasets( dataset_name = "Zooplankton_2022", unzip_file = TRUE, verbose = FALSE )) # Get all Chlorophyll datasets and return as a combined data frame try(combined_df <- get_shark_datasets( dataset_name = "Chlorophyll", return_df = TRUE, verbose = FALSE )) if (exists("combined_df")) head(combined_df)# Get a specific dataset try(get_shark_datasets("SHARK_Phytoplankton_2023_SMHI_BVVF", verbose = FALSE)) # Get all Zooplankton datasets from 2022 and unzip them try(get_shark_datasets( dataset_name = "Zooplankton_2022", unzip_file = TRUE, verbose = FALSE )) # Get all Chlorophyll datasets and return as a combined data frame try(combined_df <- get_shark_datasets( dataset_name = "Chlorophyll", return_df = TRUE, verbose = FALSE )) if (exists("combined_df")) head(combined_df)
The get_shark_options() function retrieves available search options from the SHARK database.
It sends a GET request to the SHARK API and returns the results as a structured named list.
get_shark_options(prod = TRUE, utv = FALSE, unparsed = FALSE)get_shark_options(prod = TRUE, utv = FALSE, unparsed = FALSE)
prod |
Logical value that selects the production server when |
utv |
Logical value that selects the UTV server when |
unparsed |
Logical. If |
This function sends a GET request to the /api/options endpoint of the SHARK API
to retrieve available search filters and options that can be used in SHARK data queries.
A named list of available search options from the SHARK API.
If unparsed = TRUE, returns the raw JSON structure as a list.
get_shark_data() for retrieving actual data from the SHARK API.
https://shark.smhi.se/en for the SHARK database portal.
# Retrieve available search options (simplified) try(shark_options <- get_shark_options()) if (exists("shark_options")) names(shark_options) # Retrieve full unparsed JSON response try(raw_options <- get_shark_options(unparsed = TRUE)) # View available datatypes if (exists("shark_options")) print(shark_options$dataTypes)# Retrieve available search options (simplified) try(shark_options <- get_shark_options()) if (exists("shark_options")) names(shark_options) # Retrieve full unparsed JSON response try(raw_options <- get_shark_options(unparsed = TRUE)) # View available datatypes if (exists("shark_options")) print(shark_options$dataTypes)
Downloads SHARK data for a given time period, filters to numeric parameters, and calculates descriptive statistics and Tukey outlier thresholds.
get_shark_statistics( fromYear = NULL, toYear = NULL, datatype = NULL, group_col = NULL, min_obs = 3, max_non_numeric_frac = 0.05, cache_result = FALSE, prod = TRUE, utv = FALSE, verbose = TRUE )get_shark_statistics( fromYear = NULL, toYear = NULL, datatype = NULL, group_col = NULL, min_obs = 3, max_non_numeric_frac = 0.05, cache_result = FALSE, prod = TRUE, utv = FALSE, verbose = TRUE )
fromYear |
Start year for download (numeric). Defaults to 5 years before the last complete year. |
toYear |
End year for download (numeric). Defaults to the last complete year. |
datatype |
Optional, one or more datatypes to filter on
(e.g. |
group_col |
Optional column name in the SHARK data to group by
(e.g. |
min_obs |
Minimum number of numeric observations required for a parameter to be included (default: 3). |
max_non_numeric_frac |
Maximum allowed fraction of non-numeric values for a parameter to be kept (default: 0.05). |
cache_result |
Logical, whether to save the result in a persistent cache
( |
prod |
Logical, whether to download from the production
( |
utv |
Logical. Select UTV server when |
verbose |
Logical, whether to show download progress messages. Default is |
By default, the function uses the previous five complete years. For example, if called in 2025 it will use data from 2020–2024.
A tibble with one row per parameter (and optionally per group) and the following columns:
Parameter name (character).
SHARK datatype (character).
Observed quantiles.
1st, 5th, 95th and 99th percentiles.
Interquartile range.
Arithmetic mean of numeric values.
Standard deviation of numeric values.
Variance of numeric values.
Coefficient of variation (sd / mean).
Median absolute deviation.
Lower/upper bounds for mild outliers (1.5 × IQR).
Lower/upper bounds for extreme outliers (3 × IQR).
Number of numeric observations used.
First year included in the SHARK data download (numeric).
Last year included in the SHARK data download (numeric).
Optional grouping column if provided.
# Uses previous 5 years automatically, Chlorophyll data only try(res <- get_shark_statistics(datatype = "Chlorophyll", verbose = FALSE)) if (exists("res")) print(res) # Group by station name and save result in persistent cache try(res_station <- get_shark_statistics(datatype = "Chlorophyll", group_col = "station_name", cache_result = TRUE, verbose = FALSE)) if (exists("res_station")) print(res_station)# Uses previous 5 years automatically, Chlorophyll data only try(res <- get_shark_statistics(datatype = "Chlorophyll", verbose = FALSE)) if (exists("res")) print(res) # Group by station name and save result in persistent cache try(res_station <- get_shark_statistics(datatype = "Chlorophyll", group_col = "station_name", cache_result = TRUE, verbose = FALSE)) if (exists("res_station")) print(res_station)
The get_shark_table_counts() function retrieves the number of records (row counts)
from various SHARK data tables based on specified filters such as year, months,
data type, stations, and taxa. To view available filter options, see
get_shark_options.
get_shark_table_counts( tableView = "sharkweb_overview", fromYear = 2019, toYear = 2020, months = c(), dataTypes = c(), parameters = c(), orderers = c(), qualityFlags = c(), deliverers = c(), projects = c(), datasets = c(), minSamplingDepth = "", maxSamplingDepth = "", checkStatus = "", redListedCategory = c(), taxonName = c(), stationName = c(), vattenDistrikt = c(), seaBasins = c(), counties = c(), municipalities = c(), waterCategories = c(), typOmraden = c(), helcomOspar = c(), seaAreas = c(), prod = TRUE, utv = FALSE )get_shark_table_counts( tableView = "sharkweb_overview", fromYear = 2019, toYear = 2020, months = c(), dataTypes = c(), parameters = c(), orderers = c(), qualityFlags = c(), deliverers = c(), projects = c(), datasets = c(), minSamplingDepth = "", maxSamplingDepth = "", checkStatus = "", redListedCategory = c(), taxonName = c(), stationName = c(), vattenDistrikt = c(), seaBasins = c(), counties = c(), municipalities = c(), waterCategories = c(), typOmraden = c(), helcomOspar = c(), seaAreas = c(), prod = TRUE, utv = FALSE )
tableView |
Character. Specifies the view of the table to retrieve. Options include:
Default is |
fromYear |
Integer. The starting year for the data to retrieve. Default is |
toYear |
Integer. The ending year for the data to retrieve. Default is |
months |
Integer vector. The months to retrieve data for (e.g., |
dataTypes |
Character vector. Specifies data types to filter, such as |
parameters |
Character vector. Optional. Parameters to filter results, such as |
orderers |
Character vector. Optional. Orderers to filter data by specific organizations. |
qualityFlags |
Character vector. Optional. Quality flags to filter data. |
deliverers |
Character vector. Optional. Deliverers to filter data by data providers. |
projects |
Character vector. Optional. Projects to filter data by specific research or monitoring projects. |
datasets |
Character vector. Optional. Datasets to filter data by specific dataset names. |
minSamplingDepth |
Numeric. Optional. Minimum depth (in meters) for sampling data. |
maxSamplingDepth |
Numeric. Optional. Maximum depth (in meters) for sampling data. |
checkStatus |
Character string. Optional. Status check to filter results. |
redListedCategory |
Character vector. Optional. Red-listed taxa for conservation filtering. |
taxonName |
Character vector. Optional. Taxa names for filtering specific species or taxa. |
stationName |
Character vector. Optional. Station names to retrieve data from specific stations. |
vattenDistrikt |
Character vector. Optional. Water district names to filter data by Swedish water districts. |
seaBasins |
Character vector. Optional. Sea basin names to filter data by different sea areas. |
counties |
Character vector. Optional. Counties to filter data within specific administrative regions in Sweden. |
municipalities |
Character vector. Optional. Municipalities to filter data within specific local regions. |
waterCategories |
Character vector. Optional. Water categories to filter data by. |
typOmraden |
Character vector. Optional. Type areas to filter data by specific areas. |
helcomOspar |
Character vector. Optional. HELCOM or OSPAR areas for regional filtering. |
seaAreas |
Character vector. Optional. Sea area codes for filtering by specific sea areas. |
prod |
Logical. Select production server when |
utv |
Logical. Select UTV server when |
An integer representing the total number of rows in the requested SHARK table after applying the specified filters.
https://shark.smhi.se/en for SHARK database.
get_shark_options to see filter options
get_shark_data to download SHARK data
# Retrieve chlorophyll data for April to June from 2019 to 2020 try(shark_data_counts <- get_shark_table_counts(fromYear = 2019, toYear = 2020, months = c(4, 5, 6), dataTypes = c("Chlorophyll"))) if (exists("shark_data_counts")) print(shark_data_counts)# Retrieve chlorophyll data for April to June from 2019 to 2020 try(shark_data_counts <- get_shark_table_counts(fromYear = 2019, toYear = 2020, months = c(4, 5, 6), dataTypes = c("Chlorophyll"))) if (exists("shark_data_counts")) print(shark_data_counts)
This function collects data from the IOC-UNESCO Toxins Database and returns information about toxins.
get_toxin_list(return_count = FALSE, insecure = FALSE)get_toxin_list(return_count = FALSE, insecure = FALSE)
return_count |
Logical. If |
insecure |
Logical. If |
The TLS certificate for toxins.hais.ioc-unesco.org may occasionally lapse.
When this happens the default (secure) request fails with a certificate
error. The insecure
argument provides a deliberate, opt-in escape hatch: in an interactive
session the function prompts before retrying without verification, while a
non-interactive session aborts and instructs the caller to set
insecure = TRUE. Only disable verification when you have confirmed that the
failure is caused by the known certificate issue, as it removes protection
against a tampered or spoofed response.
If return_count = TRUE, the function returns a numeric value representing the number of toxins in the database. Otherwise, it returns a tibble of toxins with detailed information.
https://toxins.hais.ioc-unesco.org/ for IOC-UNESCO Toxins Database.
# Retrieve the full list of toxins try(toxin_list <- get_toxin_list()) if (exists("toxin_list")) head(toxin_list) # Retrieve only the count of toxins try(toxin_count <- get_toxin_list(return_count = TRUE)) if (exists("toxin_count")) print(toxin_count) # If the server's TLS certificate has expired, the verification step can be # bypassed explicitly. Only do this when the certificate issue is known and # trusted, as it disables protection against tampering. try(toxin_list <- get_toxin_list(insecure = TRUE)) if (exists("toxin_list")) head(toxin_list)# Retrieve the full list of toxins try(toxin_list <- get_toxin_list()) if (exists("toxin_list")) head(toxin_list) # Retrieve only the count of toxins try(toxin_count <- get_toxin_list(return_count = TRUE)) if (exists("toxin_count")) print(toxin_count) # If the server's TLS certificate has expired, the verification step can be # bypassed explicitly. Only do this when the certificate issue is known and # trusted, as it disables protection against tampering. try(toxin_list <- get_toxin_list(insecure = TRUE)) if (exists("toxin_list")) head(toxin_list)
Retrieves the hierarchical taxonomy for one or more AphiaIDs from the World Register of Marine Species (WoRMS) and returns it in a wide format. Optionally, a hierarchy string column can be added that concatenates ranks.
get_worms_classification( aphia_ids, add_rank_to_hierarchy = FALSE, verbose = TRUE )get_worms_classification( aphia_ids, add_rank_to_hierarchy = FALSE, verbose = TRUE )
aphia_ids |
Numeric vector of AphiaIDs to retrieve classification for. Must not be NULL or empty. Duplicates are allowed and will be preserved in the output. |
add_rank_to_hierarchy |
Logical (default FALSE). If TRUE, the hierarchy
string prepends rank names (e.g., |
verbose |
Logical (default TRUE). If TRUE, prints progress messages and a progress bar during data retrieval. |
The function performs the following steps:
Validates input AphiaIDs and removes NA values.
Retrieves the hierarchical classification for each AphiaID using
worrms::wm_classification().
Converts the hierarchy to a wide format with one column per rank.
Adds a worms_hierarchy string concatenating all ranks.
Preserves input order and duplicates.
A tibble where each row corresponds to an input AphiaID. Typical
columns include:
The AphiaID of the taxon (matches input).
The last scientific name in the hierarchy for this AphiaID.
Columns for each rank present in the WoRMS hierarchy (e.g., Kingdom, Phylum, Class, Order, Family, Genus, Species). Missing ranks are NA.
A concatenated string of all ranks for this
AphiaID. Added for every row if wm_classification() returned
hierarchy data. Format depends on add_rank_to_hierarchy.
wm_classification, https://marinespecies.org/
# Single AphiaID try(single_taxa <- get_worms_classification(109604, verbose = FALSE)) if (exists("single_taxa")) print(single_taxa) # Multiple AphiaIDs try(multiple_taxa <- get_worms_classification(c(109604, 376667), verbose = FALSE)) if (exists("multiple_taxa")) print(multiple_taxa) # Hierarchy with ranks in the string try(with_rank <- get_worms_classification(c(109604, 376667), add_rank_to_hierarchy = TRUE, verbose = FALSE)) # Print hierarchy columns with ranks if (exists("with_rank")) print(with_rank$worms_hierarchy[1]) # Compare with result when add_rank_to_hierarchy = FALSE if (exists("multiple_taxa")) print(multiple_taxa$worms_hierarchy[1])# Single AphiaID try(single_taxa <- get_worms_classification(109604, verbose = FALSE)) if (exists("single_taxa")) print(single_taxa) # Multiple AphiaIDs try(multiple_taxa <- get_worms_classification(c(109604, 376667), verbose = FALSE)) if (exists("multiple_taxa")) print(multiple_taxa) # Hierarchy with ranks in the string try(with_rank <- get_worms_classification(c(109604, 376667), add_rank_to_hierarchy = TRUE, verbose = FALSE)) # Print hierarchy columns with ranks if (exists("with_rank")) print(with_rank$worms_hierarchy[1]) # Compare with result when add_rank_to_hierarchy = FALSE if (exists("multiple_taxa")) print(multiple_taxa$worms_hierarchy[1])
This function retrieves records from the WoRMS (World Register of Marine Species) database using the worrms R package for a given list of Aphia IDs.
If the retrieval fails, it retries a specified number of times before stopping.
get_worms_records( aphia_ids, max_retries = 3, sleep_time = 10, verbose = TRUE, aphia_id = deprecated() )get_worms_records( aphia_ids, max_retries = 3, sleep_time = 10, verbose = TRUE, aphia_id = deprecated() )
aphia_ids |
A vector of Aphia IDs for which records should be retrieved. |
max_retries |
An integer specifying the maximum number of retry attempts for each Aphia ID in case of failure. Default is 3. |
sleep_time |
A numeric value specifying the time (in seconds) to wait between retry attempts. Default is 10 seconds. |
verbose |
A logical indicating whether to print progress messages. Default is TRUE. |
aphia_id |
The function attempts to fetch records for each Aphia ID in the provided vector. If a retrieval fails, it retries up to
the specified max_retries, with a pause of sleep_time seconds between attempts. If all retries fail for an Aphia ID, the function
stops with an error message.
A tibble containing the retrieved WoRMS records for the provided Aphia IDs. Each row corresponds to one Aphia ID.
https://marinespecies.org/ for WoRMS website.
https://CRAN.R-project.org/package=worrms
# Example usage with a vector of Aphia IDs aphia_ids <- c(12345, 67890, 112233) try(worms_records <- get_worms_records(aphia_ids, verbose = FALSE)) if (exists("worms_records")) print(worms_records)# Example usage with a vector of Aphia IDs aphia_ids <- c(12345, 67890, 112233) try(worms_records <- get_worms_records(aphia_ids, verbose = FALSE)) if (exists("worms_records")) print(worms_records)
Retrieves the hierarchical taxonomy for one or more AphiaIDs from the World Register of Marine Species (WoRMS). Optionally, the function can include all descendants of taxa at a specified rank and/or synonyms for all retrieved taxa.
get_worms_taxonomy_tree( aphia_ids, add_descendants = FALSE, add_descendants_rank = "Species", add_synonyms = FALSE, add_hierarchy = FALSE, add_rank_to_hierarchy = FALSE, verbose = TRUE )get_worms_taxonomy_tree( aphia_ids, add_descendants = FALSE, add_descendants_rank = "Species", add_synonyms = FALSE, add_hierarchy = FALSE, add_rank_to_hierarchy = FALSE, verbose = TRUE )
aphia_ids |
Numeric vector of AphiaIDs to retrieve taxonomy for. Must not be missing or all NA. |
add_descendants |
Logical (default FALSE). If TRUE, retrieves all
child taxa for each taxon at the rank specified by |
add_descendants_rank |
Character (default |
add_synonyms |
Logical (default FALSE). If TRUE, retrieves synonym records for all retrieved taxa and appends them to the dataset. |
add_hierarchy |
Logical (default FALSE). If TRUE, adds a |
add_rank_to_hierarchy |
Logical (default FALSE). If TRUE, the hierarchy
string prepends rank names (e.g., |
verbose |
Logical (default TRUE). If TRUE, prints progress messages and progress bars during data retrieval. |
The function performs the following steps:
Validates input AphiaIDs and removes NA values.
Retrieves the hierarchical classification for each AphiaID using
worrms::wm_classification().
Optionally retrieves all descendants at the rank specified by
add_descendants_rank if add_descendants = TRUE.
Optionally retrieves synonyms for all retrieved taxa if
add_synonyms = TRUE.
Optionally adds a hierarchy column if add_hierarchy = TRUE.
Returns a combined, distinct dataset of all records.
A tibble containing detailed WoRMS records for all requested
AphiaIDs, including optional descendants and synonyms. Typical columns
include:
The AphiaID of the taxon.
The AphiaID of the parent taxon.
Scientific name of the taxon.
Taxonomic rank (e.g., Kingdom, Phylum, Genus, Species).
Taxonomic status (e.g., accepted, unaccepted).
AphiaID of the accepted taxon, if the record is a synonym.
Added only if a Species rank exists in the retrieved
data and if add_hierarchy = TRUE; otherwise not present.
Added only if a parentName rank exists in the retrieved
data and if add_hierarchy = TRUE; otherwise not present.
Added only if add_hierarchy = TRUE and hierarchy
data are available. Contains a concatenated string of the taxonomic
path.
Additional columns returned by WoRMS, including authorship and source information.
add_worms_taxonomy, construct_dyntaxa_table
wm_classification, wm_children, wm_synonyms
https://marinespecies.org/ for the WoRMS website.
# Retrieve hierarchy for a single AphiaID try(get_worms_taxonomy_tree(aphia_ids = 109604, verbose = FALSE)) # Retrieve hierarchy including species-level descendants try(get_worms_taxonomy_tree( aphia_ids = c(109604, 376667), add_descendants = TRUE, verbose = FALSE )) # Retrieve hierarchy including hierarchy column try(get_worms_taxonomy_tree( aphia_ids = c(109604, 376667), add_hierarchy = TRUE, verbose = FALSE ))# Retrieve hierarchy for a single AphiaID try(get_worms_taxonomy_tree(aphia_ids = 109604, verbose = FALSE)) # Retrieve hierarchy including species-level descendants try(get_worms_taxonomy_tree( aphia_ids = c(109604, 376667), add_descendants = TRUE, verbose = FALSE )) # Retrieve hierarchy including hierarchy column try(get_worms_taxonomy_tree( aphia_ids = c(109604, 376667), add_hierarchy = TRUE, verbose = FALSE ))
Checks whether the supplied scientific names exist in the Swedish taxonomic database Dyntaxa. Optionally, returns a data frame with taxon names, taxon IDs, and match status.
is_in_dyntaxa( taxon_names, subscription_key = Sys.getenv("DYNTAXA_KEY"), use_dwca = FALSE, return_df = FALSE, verbose = FALSE )is_in_dyntaxa( taxon_names, subscription_key = Sys.getenv("DYNTAXA_KEY"), use_dwca = FALSE, return_df = FALSE, verbose = FALSE )
taxon_names |
Character vector of taxon names to check. |
subscription_key |
A Dyntaxa API subscription key. By default, the key
is read from the environment variable You can provide the key in three ways:
|
use_dwca |
Logical; if TRUE, uses the DwCA version of Dyntaxa instead of querying the API. |
return_df |
Logical; if TRUE, returns a data frame with columns |
verbose |
Logical; if TRUE, prints messages about unmatched taxa. |
A valid Dyntaxa API subscription key is required. You can request a free key for the "Taxonomy" service from the ArtDatabanken API portal: https://api-portal.artdatabanken.se/
If return_df = FALSE (default), a logical vector indicating whether each input
name was found in Dyntaxa. Returned invisibly if verbose = TRUE.
If return_df = TRUE, a data frame with columns:
taxon_name: original input names
taxon_id: corresponding Dyntaxa taxon IDs (NA if not found)
match: logical indicating presence in Dyntaxa
## Not run: # Using an environment variable (recommended for convenience) Sys.setenv(DYNTAXA_KEY = "your_key_here") is_in_dyntaxa(c("Skeletonema marinoi", "Nonexistent species")) # Return a data frame instead of logical vector is_in_dyntaxa(c("Skeletonema marinoi", "Nonexistent species"), return_df = TRUE) # Or pass the key directly is_in_dyntaxa("Skeletonema marinoi", subscription_key = "your_key_here") # Suppress messages is_in_dyntaxa("Skeletonema marinoi", verbose = FALSE) ## End(Not run)## Not run: # Using an environment variable (recommended for convenience) Sys.setenv(DYNTAXA_KEY = "your_key_here") is_in_dyntaxa(c("Skeletonema marinoi", "Nonexistent species")) # Return a data frame instead of logical vector is_in_dyntaxa(c("Skeletonema marinoi", "Nonexistent species"), return_df = TRUE) # Or pass the key directly is_in_dyntaxa("Skeletonema marinoi", subscription_key = "your_key_here") # Suppress messages is_in_dyntaxa("Skeletonema marinoi", verbose = FALSE) ## End(Not run)
This function downloads and sources the SHARK4R required and recommended field definitions directly from the SHARK4R-statistics GitHub repository.
The definitions are stored in an R script (fields.R) located in the fields/ folder of the repository.
The function sources this file directly from GitHub into the current R session.
load_shark4r_fields(verbose = TRUE)load_shark4r_fields(verbose = TRUE)
verbose |
Logical; if |
The sourced script defines two main objects:
required_fields — vector or data frame of required SHARK fields.
recommended_fields — vector or data frame of recommended SHARK fields.
The output of this function can be directly supplied to the
check_fields function through its field_definitions argument
for validating SHARK4R data consistency.
If sourcing fails (e.g., due to a network issue or repository changes), the function throws an error with a descriptive message.
Invisibly returns a list with two elements:
Object containing required SHARK fields.
Object containing recommended SHARK fields.
check_fields for validating datasets using the loaded field definitions (as field_definitions).
load_shark4r_stats for loading precomputed SHARK4R statistics,
# Load SHARK4R field definitions from GitHub try(fields <- load_shark4r_fields(verbose = FALSE)) # Access required or recommended fields for the first entry if (exists("fields")) fields[[1]]$required if (exists("fields")) fields[[1]]$recommended ## Not run: # Use the loaded definitions in check_fields() check_fields(my_data, field_definitions = fields) ## End(Not run)# Load SHARK4R field definitions from GitHub try(fields <- load_shark4r_fields(verbose = FALSE)) # Access required or recommended fields for the first entry if (exists("fields")) fields[[1]]$required if (exists("fields")) fields[[1]]$recommended ## Not run: # Use the loaded definitions in check_fields() check_fields(my_data, field_definitions = fields) ## End(Not run)
This function downloads and loads precomputed SHARK4R statistical data
(e.g., threshold or summary statistics) directly from the
SHARK4R-statistics GitHub repository.
The data are stored as .rds files and read into R as objects.
load_shark4r_stats(file_name = "sea_basin.rds", verbose = TRUE)load_shark4r_stats(file_name = "sea_basin.rds", verbose = TRUE)
file_name |
Character string specifying the name of the |
verbose |
Logical; if |
The function retrieves the file from the GitHub repository’s data/ folder.
It temporarily downloads the file to the local system and then reads it into R using readRDS().
If the download fails (e.g., due to a network issue or invalid filename), the function throws an error with a descriptive message.
An R object (typically a tibble or data.frame) read from the specified .rds file.
check_outliers for detecting threshold exceedances using the loaded statistics,
get_shark_statistics for generating and caching statistical summaries used in SHARK4R.
scatterplot for generating interactive plots with threshold values.
# Load the default SHARK4R statistics file try(stats <- load_shark4r_stats(verbose = FALSE)) if (exists("stats")) print(stats) # Load a specific file try(thresholds <- load_shark4r_stats("scientific_name.rds", verbose = FALSE)) if (exists("thresholds")) print(thresholds)# Load the default SHARK4R statistics file try(stats <- load_shark4r_stats(verbose = FALSE)) if (exists("stats")) print(stats) # Load a specific file try(thresholds <- load_shark4r_stats("scientific_name.rds", verbose = FALSE)) if (exists("thresholds")) print(thresholds)
Retrieves shore distance, environmental grids, and area values for given coordinates. Coordinates may be supplied either through a data frame or as separate numeric vectors.
lookup_xy( data = NULL, lon = NULL, lat = NULL, shoredistance = TRUE, grids = TRUE, areas = FALSE, as_data_frame = TRUE )lookup_xy( data = NULL, lon = NULL, lat = NULL, shoredistance = TRUE, grids = TRUE, areas = FALSE, as_data_frame = TRUE )
data |
Optional data frame containing coordinate columns. The expected names are
|
lon |
Optional numeric vector of longitudes. Must be supplied together with |
lat |
Optional numeric vector of latitudes. Must be supplied together with |
shoredistance |
Logical; if |
grids |
Logical; if |
areas |
Logical or numeric. When logical, |
as_data_frame |
Logical; if |
When both vector inputs and a data frame are provided, the vector inputs take precedence.
Coordinates are validated and cleaned before lookup, and only unique values are queried.
Queries are processed in batches to avoid overloading the remote service.
Area retrieval accepts either a logical flag or a radius. A radius of zero corresponds to requesting a single area value.
Final results are reordered to match the original input positions.
The function has been modified from the obistools package (Provoost and Bosch, 2024).
A data frame or list, depending on as_data_frame. Invalid coordinates produce
NA entries (data frame) or NULL elements (list). Duplicate input coordinates
return repeated results.
Provoost P, Bosch S (2024). “obistools: Tools for data enhancement and quality control” Ocean Biodiversity Information System. Intergovernmental Oceanographic Commission of UNESCO. R package version 0.1.0, https://iobis.github.io/obistools/.
check_onland, check_depth, https://iobis.github.io/xylookup/ – OBIS xylookup web service
# Using a data frame df <- data.frame(sample_longitude_dd = c(10.9, 18.3), sample_latitude_dd = c(58.1, 58.3)) try(lookup_xy(df)) # Area search within a radius try(lookup_xy(df, areas = 500)) # Using separate coordinate vectors try(lookup_xy(lon = c(10.9, 18.3), lat = c(58.1, 58.3)))# Using a data frame df <- data.frame(sample_longitude_dd = c(10.9, 18.3), sample_latitude_dd = c(58.1, 58.3)) try(lookup_xy(df)) # Area search within a radius try(lookup_xy(df, areas = 500)) # Using separate coordinate vectors try(lookup_xy(lon = c(10.9, 18.3), lat = c(58.1, 58.3)))
This function searches the AlgaeBase API for genus information and returns detailed taxonomic data, including higher taxonomy, taxonomic status, scientific names, and other related metadata.
match_algaebase_genus( genus, subscription_key = Sys.getenv("ALGAEBASE_KEY"), higher = TRUE, unparsed = FALSE, newest_only = TRUE, exact_matches_only = TRUE, apikey = deprecated() )match_algaebase_genus( genus, subscription_key = Sys.getenv("ALGAEBASE_KEY"), higher = TRUE, unparsed = FALSE, newest_only = TRUE, exact_matches_only = TRUE, apikey = deprecated() )
genus |
The genus name to search for (character string). This parameter is required. |
subscription_key |
A character string containing the API key for accessing the AlgaeBase API. By default, the key
is read from the environment variable You can provide the key in three ways:
|
higher |
A boolean flag indicating whether to include higher taxonomy in the output (default is TRUE). |
unparsed |
A boolean flag indicating whether to return the raw JSON output from the API (default is FALSE). |
newest_only |
A boolean flag to return only the most recent entry (default is TRUE). |
exact_matches_only |
A boolean flag to limit results to exact matches (default is TRUE). |
apikey |
A valid API key is requested from the AlgaeBase team.
A tibble with the following columns:
id — AlgaeBase identifier.
accepted_name — Accepted scientific name (if different from the input).
input_name — The genus name supplied by the user.
input_match — Indicator of exact match (1 = exact, 0 = not exact).
currently_accepted — Indicator if the taxon is currently accepted (1 = TRUE, 0 = FALSE).
genus_only — Indicator if the search was for a genus only (1 = genus, 0 = genus + species).
kingdom, phylum, class, order, family — Higher taxonomy (returned if higher = TRUE).
taxonomic_status — Status of the taxon (e.g., currently accepted, synonym, unverified).
taxon_rank — Taxonomic rank of the accepted name (e.g., genus, species).
mod_date — Date when the entry was last modified.
long_name — Full scientific name including author and date (if available).
authorship — Author information (if available).
https://www.algaebase.org/ for AlgaeBase website.
## Not run: match_algaebase_genus("Anabaena", subscription_key = "your_api_key") ## End(Not run)## Not run: match_algaebase_genus("Anabaena", subscription_key = "your_api_key") ## End(Not run)
This function searches the AlgaeBase API for species based on genus and species names. It allows for flexible search parameters such as filtering by exact matches, returning the most recent results, and including higher taxonomy details.
match_algaebase_species( genus, species, subscription_key = Sys.getenv("ALGAEBASE_KEY"), higher = TRUE, unparsed = FALSE, newest_only = TRUE, exact_matches_only = TRUE, apikey = deprecated() )match_algaebase_species( genus, species, subscription_key = Sys.getenv("ALGAEBASE_KEY"), higher = TRUE, unparsed = FALSE, newest_only = TRUE, exact_matches_only = TRUE, apikey = deprecated() )
genus |
A character string specifying the genus name. |
species |
A character string specifying the species or specific epithet. |
subscription_key |
A character string containing the API key for accessing the AlgaeBase API. By default, the key
is read from the environment variable You can provide the key in three ways:
|
higher |
A logical value indicating whether to include higher taxonomy details (default is |
unparsed |
A logical value indicating whether to print the full JSON response from the API (default is |
newest_only |
A logical value indicating whether to return only the most recent entries (default is |
exact_matches_only |
A logical value indicating whether to return only exact matches (default is |
apikey |
A valid API key is requested from the AlgaeBase team.
This function queries the AlgaeBase API for species based on the genus and species names, and filters the results based on various parameters. The function handles different taxonomic ranks and formats the output for easy use. It can merge higher taxonomy data if requested.
A tibble with details about the species, including:
taxonomic_status — The current status of the taxon (e.g., accepted, synonym, unverified).
taxon_rank — The rank of the taxon (e.g., species, genus).
accepted_name — The currently accepted scientific name, if applicable.
authorship — Author information for the scientific name (if available).
mod_date — Date when the taxonomic record was last modified.
... — Other relevant information returned by the data source.
https://www.algaebase.org/ for AlgaeBase website.
## Not run: # Search for a species with exact matches only, return the most recent results result <- match_algaebase_species( genus = "Skeletonema", species = "marinoi", subscription_key = "your_api_key" ) # Print result print(result) ## End(Not run)## Not run: # Search for a species with exact matches only, return the most recent results result <- match_algaebase_species( genus = "Skeletonema", species = "marinoi", subscription_key = "your_api_key" ) # Print result print(result) ## End(Not run)
This function queries the AlgaeBase API to retrieve taxonomic information for a list of algae names based on genus and (optionally) species. It supports exact matching, genus-only searches, and retrieval of higher taxonomic ranks.
match_algaebase_taxa( genera, species, subscription_key = Sys.getenv("ALGAEBASE_KEY"), genus_only = FALSE, higher = TRUE, unparsed = FALSE, exact_matches_only = TRUE, sleep_time = 1, newest_only = TRUE, verbose = TRUE, apikey = deprecated(), genus = deprecated() )match_algaebase_taxa( genera, species, subscription_key = Sys.getenv("ALGAEBASE_KEY"), genus_only = FALSE, higher = TRUE, unparsed = FALSE, exact_matches_only = TRUE, sleep_time = 1, newest_only = TRUE, verbose = TRUE, apikey = deprecated(), genus = deprecated() )
genera |
A character vector of genus names. |
species |
A character vector of species names corresponding to the |
subscription_key |
A character string containing the API key for accessing the AlgaeBase API. By default, the key
is read from the environment variable
|
genus_only |
Logical. If |
higher |
Logical. If |
unparsed |
Logical. If |
exact_matches_only |
Logical. If |
sleep_time |
Numeric. The delay (in seconds) between consecutive AlgaeBase API queries. Defaults to |
newest_only |
A logical value indicating whether to return only the most recent entries (default is |
verbose |
Logical. If |
apikey |
|
genus |
A valid API key is requested from the AlgaeBase team.
Scientific names can be parsed using the parse_scientific_names() function before being processed by match_algaebase_taxa().
Duplicate genus-species combinations are handled efficiently by querying each unique combination only once. Genus-only searches are performed when genus_only = TRUE
or when the species name is missing or invalid. Errors during API queries are gracefully handled by returning rows with NA values for missing or unavailable data.
The function allows for integration with data analysis workflows that require resolving or verifying taxonomic names against AlgaeBase.
A tibble containing taxonomic information for each input genus–species combination.
The following columns may be included:
id — AlgaeBase ID (if available).
kingdom, phylum, class, order, family — Higher taxonomy (returned if higher = TRUE).
genus, species, infrasp — Genus, species, and infraspecies names (if applicable).
taxonomic_status — Status of the name (e.g., accepted, synonym, unverified).
currently_accepted — Logical indicator whether the name is currently accepted (TRUE/FALSE).
accepted_name — Currently accepted name if different from the input name.
input_name — The name supplied by the user.
input_match — Indicator of exact match (1 = exact, 0 = not exact).
taxon_rank — Taxonomic rank of the accepted name (e.g., genus, species).
mod_date — Date when the entry was last modified in AlgaeBase.
long_name — Full species name with authorship and date.
authorship — Author(s) associated with the species name.
https://www.algaebase.org/ for AlgaeBase website.
parse_scientific_names for parsing taxonomic names before passing them to the function.
## Not run: # Example with genus and species vectors genus_vec <- c("Thalassiosira", "Skeletonema", "Tripos") species_vec <- c("pseudonana", "costatum", "furca") algaebase_results <- match_algaebase_taxa( genera = genus_vec, species = species_vec, subscription_key = "your_api_key", exact_matches_only = TRUE, verbose = TRUE ) head(algaebase_results) ## End(Not run)## Not run: # Example with genus and species vectors genus_vec <- c("Thalassiosira", "Skeletonema", "Tripos") species_vec <- c("pseudonana", "costatum", "furca") algaebase_results <- match_algaebase_taxa( genera = genus_vec, species = species_vec, subscription_key = "your_api_key", exact_matches_only = TRUE, verbose = TRUE ) head(algaebase_results) ## End(Not run)
This function matches a list of taxon names against the SLU Artdatabanken API (Dyntaxa) and retrieves the best matches along with their taxon IDs.
match_dyntaxa_taxa( taxon_names, subscription_key = Sys.getenv("DYNTAXA_KEY"), multiple_options = FALSE, searchFields = "Both", isRecommended = "NotSet", isOkForObservationSystems = "NotSet", culture = "sv_SE", page = 1, pageSize = 100, verbose = TRUE )match_dyntaxa_taxa( taxon_names, subscription_key = Sys.getenv("DYNTAXA_KEY"), multiple_options = FALSE, searchFields = "Both", isRecommended = "NotSet", isOkForObservationSystems = "NotSet", culture = "sv_SE", page = 1, pageSize = 100, verbose = TRUE )
taxon_names |
A vector of taxon names to match. |
subscription_key |
A Dyntaxa API subscription key. By default, the key
is read from the environment variable You can provide the key in three ways:
|
multiple_options |
Logical. If TRUE, the function will return multiple matching names. Default is FALSE, selecting the first match. |
searchFields |
A character string indicating the search fields. Defaults to 'Both'. |
isRecommended |
A character string indicating whether the taxon is recommended. Defaults to 'NotSet'. |
isOkForObservationSystems |
A character string indicating whether the taxon is suitable for observation systems. Defaults to 'NotSet'. |
culture |
A character string indicating the culture. Defaults to 'sv_SE'. |
page |
An integer specifying the page number for pagination. Defaults to 1. |
pageSize |
An integer specifying the page size for pagination. Defaults to 100. |
verbose |
Logical. Print progress bar. Default is TRUE. |
A valid Dyntaxa API subscription key is required. You can request a free key for the "Taxonomy" service from the ArtDatabanken API portal: https://api-portal.artdatabanken.se/
Note: Please review the API conditions
and register for access before using the API. Data collected through the API
is stored at SLU Artdatabanken. Please also note that the authors of SHARK4R are not affiliated with SLU Artdatabanken.
A tibble containing the search pattern, taxon ID, and best match for each taxon name.
SLU Artdatabanken API Documentation
## Not run: # Match taxon names against SLU Artdatabanken API matched_taxa <- match_dyntaxa_taxa(c("Homo sapiens", "Canis lupus"), "your_subscription_key") print(matched_taxa) ## End(Not run)## Not run: # Match taxon names against SLU Artdatabanken API matched_taxa <- match_dyntaxa_taxa(c("Homo sapiens", "Canis lupus"), "your_subscription_key") print(matched_taxa) ## End(Not run)
Matches reported station names in your dataset against a curated station list
("station.txt"), which is synced with "Stationsregistret":
https://stationsregister.miljodatasamverkan.se/.
match_station(names, station_file = NULL, try_synonyms = TRUE, verbose = TRUE)match_station(names, station_file = NULL, try_synonyms = TRUE, verbose = TRUE)
names |
Character vector of station names to match. |
station_file |
Optional path to a custom station file (tab-delimited).
If |
try_synonyms |
Logical; if |
verbose |
Logical. If TRUE, messages will be displayed during execution. Defaults to TRUE. |
This function is useful for validating station names and identifying any unmatched or misspelled entries.
If try_synonyms = TRUE, unmatched station names are also compared
against the SYNONYM_NAMES column in the station database, splitting
multiple synonyms separated by <or>.
The function first checks if a station file path is provided via the
station_file argument. If not, it looks for the
NODC_CONFIG environment variable. This variable can point to a folder
where the NODC (Swedish National Oceanographic Data Center) configuration and station file
are stored, typically including:
<NODC_CONFIG>/config/station.txt
If NODC_CONFIG is set and the folder exists, the function will use
station.txt from that location. Otherwise, it falls back to the
bundled station.zip included in the SHARK4R package.
A data frame with two columns:
The input station names.
Logical; TRUE if the station was found in the SMHI station list (including synonyms if enabled), otherwise FALSE.
# Example stations stations <- c("ANHOLT E", "BY5 BORNHOLMSDJ", "STX999") # Check if stations names are in stations.txt (including synonyms) match_station(stations, try_synonyms = TRUE, verbose = FALSE)# Example stations stations <- c("ANHOLT E", "BY5 BORNHOLMSDJ", "STX999") # Check if stations names are in stations.txt (including synonyms) match_station(stations, try_synonyms = TRUE, verbose = FALSE)
This function retrieves records from the WoRMS database using the worrms R package for a vector of taxonomic names.
It includes retry logic to handle temporary failures and ensures all names are processed. The function can query
all names at once using a bulk API call or iterate over names individually.
match_worms_taxa( taxa_names, fuzzy = TRUE, best_match_only = TRUE, max_retries = 3, sleep_time = 10, marine_only = TRUE, bulk = FALSE, chunk_size = 500, verbose = TRUE )match_worms_taxa( taxa_names, fuzzy = TRUE, best_match_only = TRUE, max_retries = 3, sleep_time = 10, marine_only = TRUE, bulk = FALSE, chunk_size = 500, verbose = TRUE )
taxa_names |
A character vector of taxonomic names for which to retrieve records. |
fuzzy |
A logical value indicating whether to perform a fuzzy search. Default is TRUE.
Note: Fuzzy search is only applied in iterative mode ( |
best_match_only |
A logical value indicating whether to automatically select the first match and return a single match. Default is TRUE. |
max_retries |
Integer specifying the maximum number of retries for the request in case of failure. Default is 3. |
sleep_time |
Numeric specifying the number of seconds to wait before retrying a failed request. Default is 10. |
marine_only |
Logical indicating whether to restrict results to marine taxa only. Default is TRUE. |
bulk |
Logical indicating whether to perform a bulk API call for all unique names at once. Default is FALSE. |
chunk_size |
Integer specifying the maximum number of taxa per bulk API request. Default is 500.
Only used when |
verbose |
Logical indicating whether to print progress messages. Default is TRUE. |
If bulk = TRUE, all unique names are sent to the API in a single request. Fuzzy matching is ignored.
If bulk = FALSE, the function iterates over names individually, optionally using fuzzy matching.
The function retries failed requests up to max_retries times, pausing for sleep_time seconds between attempts.
Names for which no records are found will have status = "no content" and AphiaID = NA.
Names are cleaned before being passed to the API call by converting them to UTF-8, replacing problematic symbols with spaces, removing trailing periods, collapsing extra spaces and by trimming whitespace.
A tibble containing the retrieved WoRMS records. Each row corresponds to a record for a taxonomic name.
Repeated taxa in the input are preserved in the output.
https://marinespecies.org/ for WoRMS website.
https://CRAN.R-project.org/package=worrms
# Retrieve WoRMS records iteratively for two taxonomic names try(records <- match_worms_taxa(c("Amphidinium", "Karenia"), max_retries = 3, sleep_time = 5, marine_only = TRUE, verbose = FALSE)) if (exists("records")) print(records) # Retrieve WoRMS records in bulk mode (faster for many names) try(records_bulk <- match_worms_taxa(c("Amphidinium", "Karenia", "Navicula"), bulk = TRUE, marine_only = TRUE, verbose = FALSE))# Retrieve WoRMS records iteratively for two taxonomic names try(records <- match_worms_taxa(c("Amphidinium", "Karenia"), max_retries = 3, sleep_time = 5, marine_only = TRUE, verbose = FALSE)) if (exists("records")) print(records) # Retrieve WoRMS records in bulk mode (faster for many names) try(records_bulk <- match_worms_taxa(c("Amphidinium", "Karenia", "Navicula"), bulk = TRUE, marine_only = TRUE, verbose = FALSE))
This function processes a character vector of scientific names, splitting them into genus and species components. It handles binomial names (e.g., "Homo sapiens"), removes undesired descriptors (e.g., 'Cfr.', 'cf.', 'sp.', 'spp.'), and manages cases involving varieties, subspecies, or invalid species names. Special characters and whitespace are handled appropriately.
parse_scientific_names( scientific_names, remove_undesired_descriptors = TRUE, remove_subspecies = TRUE, remove_invalid_species = TRUE, encoding = "UTF-8", scientific_name = deprecated() )parse_scientific_names( scientific_names, remove_undesired_descriptors = TRUE, remove_subspecies = TRUE, remove_invalid_species = TRUE, encoding = "UTF-8", scientific_name = deprecated() )
scientific_names |
A character vector containing scientific names, which may include binomials, additional descriptors, or varieties. |
remove_undesired_descriptors |
Logical, if TRUE, undesired descriptors (e.g., 'Cfr.', 'cf.', 'colony', 'cells', etc.) are removed. Default is TRUE. |
remove_subspecies |
Logical, if TRUE, subspecies/variety descriptors (e.g., 'var.', 'subsp.', 'f.', etc.) are removed. Default is TRUE. |
remove_invalid_species |
Logical, if TRUE, invalid species names (e.g., 'sp.', 'spp.') are removed. Default is TRUE. |
encoding |
A string specifying the encoding to be used for the input names (e.g., 'UTF-8'). Default is 'UTF-8'. |
scientific_name |
A tibble with two columns:
genus — Genus names.
species — Species names (empty if unavailable or invalid).
Invalid descriptors such as "sp.", "spp.", and numeric entries
are excluded from this column.
https://www.algaebase.org/ for AlgaeBase website.
# Example with a vector of scientific names scientific_names <- c("Skeletonema marinoi", "Cf. Azadinium perforatum", "Gymnodinium sp.", "Melosira varians", "Aulacoseira islandica var. subarctica") # Parse names result <- parse_scientific_names(scientific_names) # Check the resulting data print(result)# Example with a vector of scientific names scientific_names <- c("Skeletonema marinoi", "Cf. Azadinium perforatum", "Gymnodinium sp.", "Melosira varians", "Aulacoseira islandica var. subarctica") # Parse names result <- parse_scientific_names(scientific_names) # Check the resulting data print(result)
Generates an interactive map using the leaflet package, plotting sampling
stations from a data frame. The function automatically detects column names
for station, longitude, and latitude, supporting both standard and
delivery-style datasets.
plot_map_leaflet(data, provider = "CartoDB.Positron")plot_map_leaflet(data, provider = "CartoDB.Positron")
data |
A data frame containing station coordinates and names. The function accepts either:
|
provider |
Character. The tile provider to use for the map background.
See available providers at
https://leaflet-extras.github.io/leaflet-providers/preview/.
Defaults to |
An HTML widget object (leaflet map) that can be printed or displayed
in R Markdown or Shiny applications.
# Example data df <- data.frame( station_name = c("Station A", "Station B"), sample_longitude_dd = c(10.0, 10.5), sample_latitude_dd = c(59.0, 59.5) ) # Plot points on map map <- plot_map_leaflet(df) # Example data in SHARK delivery format df_deliv <- data.frame( STATN = c("Station A", "Station B"), LONGI = c(10.0, 10.5), LATIT = c(59.0, 59.5) ) # Plot points on map map_deliv <- plot_map_leaflet(df_deliv)# Example data df <- data.frame( station_name = c("Station A", "Station B"), sample_longitude_dd = c(10.0, 10.5), sample_latitude_dd = c(59.0, 59.5) ) # Plot points on map map <- plot_map_leaflet(df) # Example data in SHARK delivery format df_deliv <- data.frame( STATN = c("Station A", "Station B"), LONGI = c(10.0, 10.5), LATIT = c(59.0, 59.5) ) # Plot points on map map_deliv <- plot_map_leaflet(df_deliv)
This function is a wrapper/re-export of
iRfcb::ifcb_is_near_land(). The iRfcb package is only required
if you want to actually call this function.
Determines whether given positions are near land based on a land polygon shape file.
positions_are_near_land( latitudes, longitudes, distance = 500, shape = NULL, source = "obis", crs = 4326, remove_small_islands = TRUE, small_island_threshold = 2e+06, plot = FALSE, verbose = TRUE )positions_are_near_land( latitudes, longitudes, distance = 500, shape = NULL, source = "obis", crs = 4326, remove_small_islands = TRUE, small_island_threshold = 2e+06, plot = FALSE, verbose = TRUE )
latitudes |
Numeric vector of latitudes for positions. |
longitudes |
Numeric vector of longitudes for positions. Must be the same length as |
distance |
Buffer distance (in meters) from the coastline to consider "near land." Default is 500 meters. |
shape |
Optional path to a shapefile ( |
source |
Character string indicating which default coastline source to use when |
crs |
Coordinate reference system (CRS) to use for input and output. Default is EPSG code 4326 (WGS84). |
remove_small_islands |
Logical indicating whether to remove small islands from
the coastline. Useful in archipelagos. Default is |
small_island_threshold |
Area threshold in square meters below which islands
will be considered small and removed, if remove_small_islands is set to |
plot |
A boolean indicating whether to plot the points, land polygon and buffer. Default is |
verbose |
A logical indicating whether to print progress messages. Default is TRUE. |
This function calculates a buffered area around the coastline using a polygon shapefile and determines if each input position intersects with this buffer or the landmass itself. By default, it uses the OBIS land vector dataset.
Coastline sources used when shape = NULL:
"obis" - the land polygon distributed by the
Ocean Biodiversity Information System, downloaded from
https://obis-resources.s3.amazonaws.com/land.gpkg.
The first call downloads and caches the file under
clean_shark4r_cache().
"ne" - Natural Earth 1:10m coastline / land vectors,
provided via the rnaturalearth package; see
https://www.naturalearthdata.com.
"eea" - high-resolution European coastline from the
European Environment Agency (EEA Coastline 2017). Downloaded
chunked from the EEA arcgis service the first time and cached
locally. Dataset metadata:
https://sdi.eea.europa.eu/catalogue/datahub/api/records/9faa6ea1-372a-4826-a3c7-fb5b05e31c52/formatters/xsl-view?output=pdf&language=eng&approved=true.
If plot = FALSE (default), a logical vector is returned indicating whether each position
is near land or not, with NA for positions where coordinates are missing.
If plot = TRUE, a ggplot object is returned showing the land polygon, buffer area,
and position points colored by their proximity to land.
clean_shark4r_cache() to manually clear cached shape files.
iRfcb::ifcb_is_near_land for the original function.
# Define coordinates latitudes <- c(62.500353, 58.964498, 57.638725, 56.575338) longitudes <- c(17.845993, 20.394418, 18.284523, 16.227174) # Call the function try(near_land <- positions_are_near_land(latitudes, longitudes, distance = 300, crs = 4326)) # Print the result if (exists("near_land")) print(near_land)# Define coordinates latitudes <- c(62.500353, 58.964498, 57.638725, 56.575338) longitudes <- c(17.845993, 20.394418, 18.284523, 16.227174) # Call the function try(near_land <- positions_are_near_land(latitudes, longitudes, distance = 300, crs = 4326)) # Print the result if (exists("near_land")) print(near_land)
This function reads a sample file exported as an Excel (.xlsx) file from Plankton Toolbox and extracts data from a specified sheet. The default sheet is "sample_data.txt", which contains count data.
read_ptbx( file_path, sheet = c("sample_data.txt", "sample_info.txt", "counting_method.txt", "Sample summary", "README") )read_ptbx( file_path, sheet = c("sample_data.txt", "sample_info.txt", "counting_method.txt", "Sample summary", "README") )
file_path |
Character. Path to the Excel file. |
sheet |
Character. The name of the sheet to read. Must be one of: "sample_data.txt", "Sample summary", "sample_info.txt", "counting_method.txt", or "README". Default is "sample_data.txt". |
A tibble containing the contents of the selected sheet.
https://nordicmicroalgae.org/plankton-toolbox/ for downloading Plankton Toolbox.
https://github.com/planktontoolbox/plankton-toolbox/ for Plankton Toolbox source code.
# Read the default data sheet sample_data <- read_ptbx(system.file("extdata/Anholt_E_2024-09-15_0-10m.xlsx", package = "SHARK4R")) # Print output sample_data # Read a specific sheet sample_info <- read_ptbx(system.file("extdata/Anholt_E_2024-09-15_0-10m.xlsx", package = "SHARK4R"), sheet = "sample_info.txt") # Print output sample_info# Read the default data sheet sample_data <- read_ptbx(system.file("extdata/Anholt_E_2024-09-15_0-10m.xlsx", package = "SHARK4R")) # Print output sample_data # Read a specific sheet sample_info <- read_ptbx(system.file("extdata/Anholt_E_2024-09-15_0-10m.xlsx", package = "SHARK4R"), sheet = "sample_info.txt") # Print output sample_info
Reads tab- or semicolon-delimited SHARK export files with standardized format.
The function can handle plain text files (.txt) or zip archives (.zip) containing
a file named shark_data.txt. It automatically detects and converts column types
and can optionally coerce the "value" column to numeric. The "sample_date" column
is converted to Date if it exists.
read_shark( filename, delimiters = "point-tab", encoding = "utf_8", guess_encoding = TRUE, value_numeric = TRUE )read_shark( filename, delimiters = "point-tab", encoding = "utf_8", guess_encoding = TRUE, value_numeric = TRUE )
filename |
Path to the SHARK export file. Can be a |
delimiters |
Character. Specifies the delimiter used in the file. Options:
|
encoding |
Character. File encoding. Options: |
guess_encoding |
Logical. If |
value_numeric |
Logical. If |
This function is robust to file encoding issues. By default (guess_encoding = TRUE),
it attempts to automatically detect the file encoding and will use it if it differs
from the user-specified encoding. Automatic detection can be disabled.
A data frame containing the parsed contents of the SHARK export file,
or NULL if the file is empty or could not be read.
read_shark_deliv() for reading SHARK Excel delivery files (.xls/.xlsx).
## Not run: # Read a plain text SHARK export df_txt <- read_shark("sharkweb_data.txt") # Read a SHARK export from a zip archive df_zip <- read_shark("shark_data.zip") # Read with explicit encoding and do not convert value df_custom <- read_shark("shark_data.txt", encoding = "latin_1", guess_encoding = FALSE, value_numeric = FALSE) ## End(Not run)## Not run: # Read a plain text SHARK export df_txt <- read_shark("sharkweb_data.txt") # Read a SHARK export from a zip archive df_zip <- read_shark("shark_data.zip") # Read with explicit encoding and do not convert value df_custom <- read_shark("shark_data.txt", encoding = "latin_1", guess_encoding = FALSE, value_numeric = FALSE) ## End(Not run)
Reads Excel files delivered to SHARK in a standardized format.
The function automatically detects whether the file is .xls or .xlsx
and reads the specified sheet, skipping a configurable number of rows.
Column types are automatically converted, and if a column "SDATE" exists,
it is converted to Date.
read_shark_deliv(filename, skip = 2, sheet = 2)read_shark_deliv(filename, skip = 2, sheet = 2)
filename |
Path to the Excel file to be read. |
skip |
Minimum number of rows to skip before reading anything (column names or data).
Leading empty rows are automatically skipped, so this is a lower bound.
Ignored if |
sheet |
Sheet to read. Either a string (sheet name) or integer (sheet index). If neither is specified, defaults to the second sheet. |
A data frame containing the parsed contents of the Excel file, or NULL if the file
does not exist, is empty, or cannot be read.
read_shark() for reading SHARK tab- or semicolon-delimited export files or zip-archives.
## Not run: # Read the second sheet of a .xlsx file (default) df_xlsx <- read_shark_deliv("shark_delivery.xlsx") # Read the first sheet of a .xls file, skipping 3 rows df_xls <- read_shark_deliv("shark_delivery.xls", skip = 3, sheet = 1) ## End(Not run)## Not run: # Read the second sheet of a .xlsx file (default) df_xlsx <- read_shark_deliv("shark_delivery.xlsx") # Read the first sheet of a .xls file, skipping 3 rows df_xls <- read_shark_deliv("shark_delivery.xls", skip = 3, sheet = 1) ## End(Not run)
This function launches the interactive Shiny application for performing quality control (QC) on SHARK data. The application provides a graphical interface for exploring and validating data before or after submission to SHARK.
run_qc_app(interactive = TRUE)run_qc_app(interactive = TRUE)
interactive |
Logical value whether the session is interactive or not. |
The function checks that all required packages for the app are installed before launching. If any are missing, the user is notified. In interactive sessions, the function will prompt whether the missing packages should be installed automatically. In non-interactive sessions (e.g. scripts or CI), the function instead raises an error and lists the missing packages so they can be installed manually.
This function is called for its side effect of launching a Shiny application. It does not return a value.
# Launch the SHARK4R Bio-QC Tool if(interactive()){ run_qc_app() }# Launch the SHARK4R Bio-QC Tool if(interactive()){ run_qc_app() }
This function creates a scatterplot from a data frame, optionally coloring points
by a grouping column and adding horizontal threshold lines. Supports both static
ggplot2 plots and interactive plotly plots with a linear/log toggle.
scatterplot( data, x = c("station_name", "sample_date"), parameter = NULL, hline = NULL, hline_group_col = NULL, hline_value_col = NULL, hline_style = list(linetype = "dashed", size = 0.8), max_hlines = 5, interactive = TRUE, verbose = TRUE )scatterplot( data, x = c("station_name", "sample_date"), parameter = NULL, hline = NULL, hline_group_col = NULL, hline_value_col = NULL, hline_style = list(linetype = "dashed", size = 0.8), max_hlines = 5, interactive = TRUE, verbose = TRUE )
data |
A data.frame or tibble containing at least the following columns:
|
x |
Character. The column to use for the x-axis. Either |
parameter |
Optional character. If provided, only data for this parameter will be plotted.
If |
hline |
Numeric or data.frame. Horizontal line(s) to add. If numeric, a single line
is drawn at that y-value. If a data.frame, must contain |
hline_group_col |
Character. Column used for grouping when |
hline_value_col |
Character. Column in |
hline_style |
List. Appearance settings for horizontal lines. Should contain |
max_hlines |
Integer. Maximum number of horizontal line groups to display per parameter when |
interactive |
Logical. If TRUE, returns an interactive |
verbose |
Logical. If TRUE, messages will be displayed during execution. Defaults to TRUE. |
If hline is numeric, a single horizontal line is drawn across the plot.
If hline is a data.frame, only the first max_hlines groups (sorted alphabetically) are displayed.
Points can be colored by hline_group_col if provided.
Interactive plots include buttons to switch between linear and log y-axis scales.
A ggplot object (if interactive = FALSE) or a plotly object (if interactive = TRUE).
load_shark4r_stats for loading threshold or summary statistics that
can be used to define horizontal lines in the plot.
## Not run: scatterplot( data = my_data, x = "station_name", parameter = "Chlorophyll-a", hline = c(10, 20) ) scatterplot( data = my_data, x = "sample_date", parameter = "Bacterial abundance", hline = thresholds_df, hline_group_col = "location_sea_basin", hline_value_col = "P99" ) ## End(Not run)## Not run: scatterplot( data = my_data, x = "station_name", parameter = "Chlorophyll-a", hline = c(10, 20) ) scatterplot( data = my_data, x = "sample_date", parameter = "Bacterial abundance", hline = thresholds_df, hline_group_col = "location_sea_basin", hline_value_col = "P99" ) ## End(Not run)
Converts user-facing datatype names (e.g., "Grey seal") to internal SHARK4R names
(e.g., "GreySeal") based on SHARK4R:::.type_lookup. See available user-facing
datatypes in get_shark_options()$dataTypes.
translate_shark_datatype(x)translate_shark_datatype(x)
x |
Character vector of datatype names to translate |
Character vector of translated datatype names
# Example strings datatypes <- c("Grey seal", "Primary production", "Physical and Chemical") # Basic translation translate_shark_datatype(datatypes)# Example strings datatypes <- c("Grey seal", "Primary production", "Physical and Chemical") # Basic translation translate_shark_datatype(datatypes)
This function updates Dyntaxa taxonomy records based on a list of Dyntaxa taxon IDs. It collects parent IDs from SLU Artdatabanken API (Dyntaxa), retrieves full taxonomy records, and organizes the data into a full taxonomic table that can be joined with data downloaded from SHARK
update_dyntaxa_taxonomy( dyntaxa_ids, subscription_key = Sys.getenv("DYNTAXA_KEY"), add_missing_taxa = FALSE, verbose = TRUE )update_dyntaxa_taxonomy( dyntaxa_ids, subscription_key = Sys.getenv("DYNTAXA_KEY"), add_missing_taxa = FALSE, verbose = TRUE )
dyntaxa_ids |
A vector of Dyntaxa taxon IDs to update. |
subscription_key |
A Dyntaxa API subscription key. By default, the key
is read from the environment variable You can provide the key in three ways:
|
add_missing_taxa |
Logical. If TRUE, the function will attempt to fetch missing taxa (i.e., taxon_ids not found in the initial Dyntaxa DwC-A query). Default is FALSE. |
verbose |
Logical. Print progress messages. Default is TRUE. |
A valid Dyntaxa API subscription key is required. You can request a free key for the "Taxonomy" service from the ArtDatabanken API portal: https://api-portal.artdatabanken.se/
Note: Please review the API conditions
and register for access before using the API. Data collected through the API
is stored at SLU Artdatabanken. Please also note that the authors of SHARK4R are not affiliated with SLU Artdatabanken.
A tibble representing the updated Dyntaxa taxonomy table.
get_shark_data, update_worms_taxonomy, SLU Artdatabanken API Documentation
## Not run: # Update Dyntaxa taxonomy for taxon IDs 238366 and 1010380 updated_taxonomy <- update_dyntaxa_taxonomy(c(238366, 1010380), "your_subscription_key") print(updated_taxonomy) ## End(Not run)## Not run: # Update Dyntaxa taxonomy for taxon IDs 238366 and 1010380 updated_taxonomy <- update_dyntaxa_taxonomy(c(238366, 1010380), "your_subscription_key") print(updated_taxonomy) ## End(Not run)
This function is a wrapper/re-export of
iRfcb::ifcb_which_basin(). The iRfcb package is only required
if you want to actually call this function.
This function identifies which sub-basin a set of latitude and longitude points belong to, using a user-specified or default shapefile.
The default shapefile includes the Baltic Sea, Kattegat, and Skagerrak basins and is included in the iRfcb package.
which_basin(latitudes, longitudes, plot = FALSE, shape_file = NULL)which_basin(latitudes, longitudes, plot = FALSE, shape_file = NULL)
latitudes |
A numeric vector of latitude points. |
longitudes |
A numeric vector of longitude points. |
plot |
A boolean indicating whether to plot the points along with the sea basins. Default is FALSE. |
shape_file |
The absolute path to a custom polygon shapefile in WGS84 (EPSG:4326) that represents the sea basin.
Defaults to the Baltic Sea, Kattegat, and Skagerrak basins included in the |
This function reads a pre-packaged shapefile of the Baltic Sea, Kattegat, and Skagerrak basins from the iRfcb package by default, or a user-supplied
shapefile if provided. The shapefiles originate from SHARK (https://shark.smhi.se/en/). It sets the CRS, transforms the CRS to WGS84 (EPSG:4326) if necessary, and checks if the given points
fall within the specified sea basin. Optionally, it plots the points and the sea basin polygons together.
A vector indicating the basin each point belongs to, or a ggplot object if plot = TRUE.
iRfcb::ifcb_which_basin for the original function.
# Define example latitude and longitude vectors latitudes <- c(55.337, 54.729, 56.311, 57.975) longitudes <- c(12.674, 14.643, 12.237, 10.637) # Check in which Baltic sea basin the points are in points_in_the_baltic <- which_basin(latitudes, longitudes) print(points_in_the_baltic) # Plot the points and the basins map <- which_basin(latitudes, longitudes, plot = TRUE)# Define example latitude and longitude vectors latitudes <- c(55.337, 54.729, 56.311, 57.975) longitudes <- c(12.674, 14.643, 12.237, 10.637) # Check in which Baltic sea basin the points are in points_in_the_baltic <- which_basin(latitudes, longitudes) print(points_in_the_baltic) # Plot the points and the basins map <- which_basin(latitudes, longitudes, plot = TRUE)