4  Taxonomy

Integrating species distribution data from 7 source datasets requires resolving taxonomic identities across multiple naming systems. The MST uses a multi-authority matching pipeline to ensure each species is uniquely identified, enabling accurate merging of models from different data providers.

4.1 Taxonomic Authorities

The following authorities are loaded into a reference database (spp.duckdb) for taxonomic reconciliation:

Table 4.1: Taxonomic authorities used in the MST for species name resolution.
Authority Description Key Identifier
WoRMS World Register of Marine Species — authoritative list for marine taxa worms_id (AphiaID)
GBIF Global Biodiversity Information Facility Backbone Taxonomy (~6.3M taxa) gbif_id
ITIS Integrated Taxonomic Information System — US federal standard itis_tsn
IUCN Red List International Union for Conservation of Nature — conservation assessments iucn_id
BirdLife BOTW Birds of the World — authoritative for seabird taxonomy botw_id

4.2 ID Resolution Cascade

Species identifiers are resolved through a cascading lookup process:

  1. ITIS TSN match: if the source dataset provides an ITIS Taxonomic Serial Number, use the ITIS-to-WoRMS crosswalk for direct matching
  2. WoRMS crosswalk: look up accepted WoRMS AphiaID via the GBIF backbone, which integrates WoRMS as the marine taxonomy source
  3. Scientific name match: for records without matching identifiers, attempt exact scientific name matching against WoRMS accepted names
  4. API lookup: for remaining unresolved names, query the WoRMS REST API (wm_records_name()) for fuzzy matching

At each stage, deprecated names are resolved to their accepted synonyms, and the taxonomicStatus field is used to determine the preferred name.

4.3 Species Categories

Valid species are classified into 7 categories based on their taxonomic position:

Table 4.2: Species categories used in the MST for scoring and visualization.
Category Description Examples
bird Seabirds and shorebirds albatross, petrel, tern, pelican
coral Reef-building and deep-sea corals stony coral, soft coral, black coral
fish Bony and cartilaginous fishes grouper, shark, ray, tuna
invertebrate Non-coral marine invertebrates crab, lobster, sea urchin, squid
mammal Marine mammals whale, dolphin, seal, manatee
reptile Sea turtles loggerhead, green, leatherback
other Uncategorized marine organisms worms, tunicates, and bryozoans

4.4 Data Quality

4.4.1 Duplicate Resolution

Multiple source datasets may provide models for the same species. The taxonomic matching step identifies duplicates by resolving all source names to canonical WoRMS or BirdLife identifiers. When a species appears in multiple datasets, the model merging pipeline (see Chapter 6) takes the MAX value across all sources rather than treating them as separate species.

4.4.2 Synonym Handling

Taxonomic names change over time as species are reclassified. The pipeline handles this by:

  • resolving all names through acceptedNameUsageID in each authority
  • tracking the original source name alongside the accepted name
  • flagging deprecated WoRMS IDs and updating them to current accepted IDs

4.4.3 Valid Species Filter (is_ok)

The source datasets together contain roughly 17,561 taxa — dominated by the ~17,000 marine species modeled globally by AquaMaps, plus the regulatory and range-map datasets. Not all are appropriate for the sensitivity analysis. A taxon is flagged valid (is_ok = TRUE) only when it meets every one of the following criteria:

  • Taxonomically accepted & marine: a resolved WoRMS/BirdLife identity with an accepted (or alternative-representation) status, classified as marine (WoRMS isMarine, or BirdLife seabird), and not extinct (WoRMS isExtinct or IUCN Red List EX); non-turtle reptiles (e.g., sea snakes) are excluded.
  • Has a mapped distribution: the merged model resolves to at least one grid cell with a value — a few stale, empty model records are excluded so the valid count matches the species actually mapped.
  • Expert range intersects the US EEZ: where an IUCN expert range map exists, it must overlap US waters. Species whose global range falls entirely outside the US EEZ are dropped even if AquaMaps predicts suitable habitat there, since those are edge-of-range model artifacts not supported by expert assessment.

Program-Area overlap is not a validity criterion. A species is valid if it has a modeled distribution anywhere in the study area; whether it falls within one of the BOEM Program Areas is determined later by a spatial (zone) query, not by the is_ok flag. This was introduced in v7 — earlier versions silently required Program-Area overlap, which conflated “valid” with “occurs in a Program Area” and hid biodiversity outside the current program cycle (see below).

This filter currently yields 16,153 valid species across the full study area (the US EEZ), of which 9,230 have modeled distributions within the 20 BOEM Program Areas of the current program cycle. These figures are pulled live from the /stats.json endpoint for the current database version (v7).

The definition of “valid species” has tightened over time, so the headline count is not directly comparable across versions:

  • v5 (~9,819) — valid after taxonomic filtering, before excluding IUCN ranges outside the US EEZ.
  • v6 (9,424) — after excluding ~371 species whose IUCN ranges fall entirely outside the US EEZ. This count also silently required overlap with a BOEM Program Area, conflating “valid” with “occurs in a Program Area.”
  • v7 (16,153) — the Program-Area requirement was removed, so the count reflects the full study-area biodiversity (the ~17,000-species universe minus the taxonomic and range exclusions). The 9,230 within the Program Areas is now reported separately as a spatial subset rather than baked into the flag.

4.4.4 From source taxa to valid species

The table below traces the cumulative effect of each is_ok gate — applied in the order used by the merge pipeline — showing the number of species removed at each step and the number remaining, from the full pool of source-dataset taxa down to the valid species, then the spatial subset that falls within the BOEM Program Areas. The largest reductions are species with no merged distribution (those whose IUCN expert range falls entirely outside the US EEZ produce zero cells) and non-marine taxa. Counts are pulled live for the current database version (v7).

Table 4.3: Cumulative is_ok filter funnel: species removed and remaining at each validity gate, then the within-Program-Area spatial subset. Pulled live from the API for the current database version.
Filter step Removed Species remaining
Source taxa (all datasets) 17,561
Resolved taxon ID 17,561
Has a merged model (distribution) −1,173 16,388
Not extinct −21 16,367
Marine −154 16,213
Accepted taxonomy (excl. non-turtle reptiles) −55 16,158
Mapped to ≥1 cell — valid (is_ok) −5 16,153
Within BOEM Program Areas (spatial subset) −6,923 9,230

4.5 Key Function

The taxonomic matching is implemented in msens::match_taxa(), which orchestrates the ID resolution cascade and returns a unified taxon table with cross-referenced identifiers from all authorities.