NZOR Data Harvest and Import

Harvesting

The main purpose of the harvesting module is to facilitate the process of gathering data from external data sources such as NZOR providers and synchronising that data with the provider data within NZOR.

Overview of Harvesting Module


 
Main Functions:


Importing

Data for import must be represented using the NZOR Provider XML schema.
Rules:

Business Logic Diagrams:
Import Name:


Deprecated Name:

Concepts

There are several ways that Concepts can be provided by a provider – either as full TaxonConcepts or as NameBasedConcepts.


These two types of concepts roughly map to taxonomic concepts and nomenclatural concepts, where nomenclatural concepts are the name relationships as defined by the circumscription of the name, following nomenclatural codes, and taxonomic concepts are subjective relationships. For this reason you can have two "types" of Parent Name – one nomenclaturally defined, the other a subjective placement of the name in the taxonomic hierarchy.
The full taxonomic concept model is as follows:
Taxon NameTaxon NameTaxon Concept
Relationship Type
Concept Relationship
Taxon Concept

Example:
Name, ID = N1, Aus bus
Concept, ID = C1, for N1 according to A1
Concept Relationship, ID = CR1, C1 has parent C2
Concept, ID = C2, for N2 according to A1
Name, ID = N2 Aus
Reference, ID = A1, "Flora of New Zealand …"
Where all parts are required, and with IDs that the provider must maintain.
The NameBasedConcept (NBC) has the following structure:

Name Based ConceptTaxon Name


(Accepted)Taxon Name


(Parent)Taxon Name

Name, ID = N1, Aus bus
NBC, N1 has parent N2 according to A2
Name, ID = N2, Aus
Reference, ID = A2, "Flora of New Zealand …"
Where only the Name IDs (and Reference) are maintained.
The problem comes when trying to import these name based concepts into the NZOR data (which has the structure of the first example).  We need to either put the NameBasedConcepts into the N-C-CR-C-N structure, generating temporary IDs or create more tables to handle NameBasedConcepts.  The former option has been adopted.
On Import of Name Based Concepts:

Global cache

Use Case:

MAF submit list of names in BRAD. 60% are matched by NZOR records. 40% are either garbage or good names not in NZOR. NZOR quesries global sources (GNI, CoL, CoL Regional Hub, GNUB (IF/Zooobank) for hits. Returns are stored in global cache and guid minted/propageted. New matching records returned to consumer. Implied preference of providers (stop when get return) > CoL, GNUB, GNI. What to do with multiple hits? Assumed consumer workflow is 1) submit list of nonmatches, 2) examine returns to identify those required, 3) resubmit with 'cache' flag switched on.

Q. two processes? First match NZOR, provides 0 hits, 1 hit, multiple hits. Consumer workflow required for 0 and multiple hits. 0 hits (excluding garbage removed by consumer workflow) resubmitted as query to global service interface. 

Service response to provide link to result set. Consumer polls end point. Gets 'not ready'/'don't know' or CSV/structured data.

Distribution: - for both existing NZ names, and global cache names, query global soures for distribution data (TDWG L4). GBIF for records in network (keep just list of countries), and CoL data (10% of CoL has distribution data). Return and integrate into NZOR record.

In web interface batch match have choise: NZOR of Global Resolution Service? In latter case the results are not kept in global cache. Consumer needs to specifically request resturns to be brokered within NZOR and cached.

Need to consider if need login/security for this (and also so consumer token can be used to track requests/usage).

In both case treat Global Cache as a set of provider records (CoL as provider, GNI as provider etc). cached records to be linked into CoL management hierarchy if possible (IF SOIRCE PROVIDES HIGHER CLASSIFICATION), IF NOT THEN PROVIDER RECORD 'FLOATS'.

initial query

web service + validation? + return data structure + flags to cache or not

global harvest from multiple sources

maintain list of end points and necessary query formats and handle returns.

integration

put return data in global cache (in name cache), link to hierarchy (maybe), trigger harvest into integrator, generate GUIDs

maintenance

shceduled re-query and replace/update?

geodistribution