Integrating species distribution data from 7 source datasets requires resolving taxonomic identities across multiple naming systems. The MST uses a multi-authority matching pipeline to ensure each species is uniquely identified, enabling accurate merging of models from different data providers.
Taxonomic Authorities
The following authorities are loaded into a reference database (spp.duckdb) for taxonomic reconciliation:
ID Resolution Cascade
Species identifiers are resolved through a cascading lookup process:
- ITIS TSN match: if the source dataset provides an ITIS Taxonomic Serial Number, use the ITIS-to-WoRMS crosswalk for direct matching
- WoRMS crosswalk: look up accepted WoRMS AphiaID via the GBIF backbone, which integrates WoRMS as the marine taxonomy source
- Scientific name match: for records without matching identifiers, attempt exact scientific name matching against WoRMS accepted names
- API lookup: for remaining unresolved names, query the WoRMS REST API (
wm_records_name()) for fuzzy matching
At each stage, deprecated names are resolved to their accepted synonyms, and the taxonomicStatus field is used to determine the preferred name.
Species Categories
Valid species are classified into 7 categories based on their taxonomic position:
Data Quality
Duplicate Resolution
Multiple source datasets may provide models for the same species. The taxonomic matching step identifies duplicates by resolving all source names to canonical WoRMS or BirdLife identifiers. When a species appears in multiple datasets, the model merging pipeline (see Chapter 6) takes the MAX value across all sources rather than treating them as separate species.
Synonym Handling
Taxonomic names change over time as species are reclassified. The pipeline handles this by:
- resolving all names through
acceptedNameUsageID in each authority
- tracking the original source name alongside the accepted name
- flagging deprecated WoRMS IDs and updating them to current accepted IDs
Valid Species Filter (is_ok)
The source datasets together contain roughly 17,561 taxa — dominated by the ~17,000 marine species modeled globally by AquaMaps, plus the regulatory and range-map datasets. Not all are appropriate for the sensitivity analysis. A taxon is flagged valid (is_ok = TRUE) only when it meets every one of the following criteria:
- Taxonomically accepted & marine: a resolved WoRMS/BirdLife identity with an accepted (or alternative-representation) status, classified as marine (WoRMS
isMarine, or BirdLife seabird), and not extinct (WoRMS isExtinct or IUCN Red List EX); non-turtle reptiles (e.g., sea snakes) are excluded.
- Has a mapped distribution: the merged model resolves to at least one grid cell with a value — a few stale, empty model records are excluded so the valid count matches the species actually mapped.
- Expert range intersects the US EEZ: where an IUCN expert range map exists, it must overlap US waters. Species whose global range falls entirely outside the US EEZ are dropped even if AquaMaps predicts suitable habitat there, since those are edge-of-range model artifacts not supported by expert assessment.
Program-Area overlap is not a validity criterion. A species is valid if it has a modeled distribution anywhere in the study area; whether it falls within one of the BOEM Program Areas is determined later by a spatial (zone) query, not by the is_ok flag. This was introduced in v7 — earlier versions silently required Program-Area overlap, which conflated “valid” with “occurs in a Program Area” and hid biodiversity outside the current program cycle (see below).
This filter currently yields 16,153 valid species across the full study area (the US EEZ), of which 9,230 have modeled distributions within the 20 BOEM Program Areas of the current program cycle. These figures are pulled live from the /stats.json endpoint for the current database version (v7).
The definition of “valid species” has tightened over time, so the headline count is not directly comparable across versions:
- v5 (~9,819) — valid after taxonomic filtering, before excluding IUCN ranges outside the US EEZ.
- v6 (9,424) — after excluding ~371 species whose IUCN ranges fall entirely outside the US EEZ. This count also silently required overlap with a BOEM Program Area, conflating “valid” with “occurs in a Program Area.”
- v7 (16,153) — the Program-Area requirement was removed, so the count reflects the full study-area biodiversity (the ~17,000-species universe minus the taxonomic and range exclusions). The 9,230 within the Program Areas is now reported separately as a spatial subset rather than baked into the flag.
From source taxa to valid species
The table below traces the cumulative effect of each is_ok gate — applied in the order used by the merge pipeline — showing the number of species removed at each step and the number remaining, from the full pool of source-dataset taxa down to the valid species, then the spatial subset that falls within the BOEM Program Areas. The largest reductions are species with no merged distribution (those whose IUCN expert range falls entirely outside the US EEZ produce zero cells) and non-marine taxa. Counts are pulled live for the current database version (v7).
Key Function
The taxonomic matching is implemented in msens::match_taxa(), which orchestrates the ID resolution cascade and returns a unified taxon table with cross-referenced identifiers from all authorities.