NZOR Data Integration

Workflow

...

UML Diagram

Image Removed ScreenHunter_01 Nov. 23 09.51.jpg Image Added

UML Diagram

worddav794473a6f96bfe5d4b0ddbeec67b688a.png [imported from a Word document] Image Added

NZOR Integration Fields

2 types of matching – simple and structured. Simple is when only the Name text is provided and possibly other fields and Structured is when a set of detailed fields are provided.
For Simple the fields RequiredForMatching need to be derived/parsed/calculated from the fields that are provided.
Simple:

...

Field	Must be provided in provider dataset	Required for matching algorithm (must be provided or derived)
ProviderRecordId	true	true
FullName	true	false
TaxonRank	false	true
GoverningCode	true	true
Canonical	true	true
ParentId	true	true
Authors	false	false
PreferredNameId	false	false
YearOfPublication	false	false
MicroReference	false	false
PublishedIn	false	false

Attachment Points

The highest ranked taxon name in a provider DataSet must either attach to a Kingdom name or a defined attachment point for that provider dataset.
Eg, a provider dataset may be the names for a particular family, say Compositae. An attachment point needs to be defined for this name (Compositae). This attachment point name MUST then be provided in the dataset so that it is possible to determine how to attach all subordinate names.
It is possible to have multiple attachment points, for different parts of the taxonomic classification hierarchy. There will need to be a Default attachment point defined for default actions such as placing a name that has unknown parentage.

Matching

Diagram:
Image Modified

Matching Components

...

Parent

The parent of a name defines where this name fits within a scientific classification, eg the Genus where a particular species is placed. (e.g. "Poa anceps" = the genus Poa is the parent name for the species "anceps"). This is a valuable property for the matching process.
Some types of names do not have a classification, e.g. Vernacular (common) names.

Name Type	Example	Has classification system
Scientific	Poa anceps	Yes
Vernacular (common)	Pohutakawa	No
Trade name	Kiwifruit (??)	No (??)
Tag name	"Gingidia aff. baxterae (White Rock Station; AK 299889)"	No (??)

Multiple parent concepts
It is possible for a provider to maintain multiple parent concepts for a particular name, and therefore have that named placed under several different parents, depending on what authority they are following at the time (= multiple classifications). At this stage we are restricting data providers to supply a single "In Use" parent concept for a name (ie they can provide multiple parent concepts, but only one of those can be flagged as In Use). If there are multiple in use parent concepts or no in use parent concepts then further calculation is required. In this case all names that fall under any of the parents defined by the parent concepts are included in the matching. This will include a larger set of names to pass through to the next stage of matching, and if the matching algorithm concludes with more than one match due to the multiple parents, the match/integration will fail. If the algorithm concludes with no matches then, because of the multiple parent concepts, it is not possible to determine where to place this name. In this case the name is connected to the default attachment point for this dataset and should be flagged in some way to indicate this scenario.
How we handle this multiple classification situation into the future needs to be clarified.

Higher Level Names
It is possible for names at a higher level in the classification hierarchy (above genus) to be defined in different ways – different classifications. This should not prevent names at the same rank from matching even if their parent is not the same. For example a Family "Famae" is defined under the Order "Orderae" in one dataset and under the Order "Otherae" in another dataset. These names (Famae) are the same – i.e. they should match during integration. This means that the parent concepts defined by a particular dataset may often have to be overlooked during matching and integration.

Names with no parents
Trade names, tag names and vernacular names do not include a classification and therefore the integration must not include matching based on parent linkages.
Even scientific names with classification have no parent at the top of the hierarchy where this set of names needs to be attached to the NZOR dataset. This is determined by the attachment points defined in the admin for that particular provider.
There is a possible issue where the name as given by the submitter is at one rank and has a defined parent a rank above, but this is not reflected by the current NZOR data due to infraspecific ranks. For example the following name and parent name may be submitted for matching:
Name : Brickellia Elliott Rank : genus Parent : Compositae Rank : family
And the NZOR data could be:
Name : Brickellia Elliott Rank : genus Parent : Eupatorieae Cass. Rank : tribeGrandParent : Compositae Rank : family
And therefore filtering to all names with the parent "Compositae" will not find the NZOR genus "Brickellia". This issue is solved by filtering to all names with the Ancestor "Compositae" AND the Rank "genus".

Rank

The taxon rank is only applicable to scientific names. Vernacular names and trade names have a rank of "none". If the name that is being matched is a scientific name then the rank is used to filter the names down to those that are at the same rank as the name to be matched.
The rank provided by submitter must match one of the standard ranks in the NZOR system, or one of the "known abbreviations" of that rank.
For infraspecific names (i.e. those below the rank of species), the use of the rank is not covered by the codes of nomenclature and hence they are interchangeable, for example the infraspecific name Aus bus var. cus is equivalent to the name Aus bus subsp. cus. This must be taken into account during matching and integration.

Canonical

The canonical is the basic name text without any authors, year, publication details or parent information.
For example:
Brickellia Elliott Canonical : Brickellia Brickellia linifolia D.C.Eaton Canonical : linifolia
Craspedia "Henderson" Canonical : "Henderson"

Authors

Authors come in many forms and can be difficult to match against. One solution to this is to build a dictionary of standardised authors and their equivalents. Then when a submitted name is processed, the author is matched against the standardised list and the "standard" author name is then used to filter NZOR names for the match.
Data table structure for Author maintenance:

Field	Type
AuthorId	Guid
CorrectAuthorId	Guid
Abbreviation	String
FirstName	String
LastName	String
TaxonGroups	String
Dates	String
AlternativeNames	String

...

Field	Type	Description
NameId	Guid
BasionymAuthors	String	List of IDs of the Basionym Authors (non-standardised AuthorId value, space separated per author)
CombinationAuthors	String	List of IDs of the Combination Authors (non-standardised AuthorId value, space separated per author)
BasionymExAuthors	String	List of IDs of the Basionym "ex" Authors (non-standardised AuthorId value, space separated per author)
CombinationExAuthors	String	List of IDs of the Basionym "ex" Authors (non-standardised AuthorId value, space separated per author)

Year

Todo …

Nomenclatural Status

Some names have different nomenclatural status (e.g. whether the name was validly published under the code – if not the status = nom. inval.). It is possible for that name to be subsequently validly published. The details of the name will be the same, but the name now has a different nomenclatural status. At first it seems to be best to treat these as separate names that are linked. However this may cause issues during integration, for example if a provider does not provide the nomenclatural status, is this data for the validly published name or the invalid one? It was decided therefore to treat these names as the same and pool all nomenclatural status values for that name – it is then up to the consumer / viewer of that name to determine the status of the name. Therefore these names instances result in the same name, but there was two nomenclatural acts that led to this name.

Anchor

	_GoBack
	_GoBack

Integration By SQL

A simpler, but much faster approach to building an initial set of integrated names is to use SQL queries.

The idea is based on the fact that most names are distinct (about 98%). It is therefore much more efficient to generate a "backbone" of names from these distinct names, rather than iterating through them all, performing a mathc to discover there are no matches, then inserting the name as a new consensus name.

This approach works with the most complete names first, with the theory that a name with less detail will match multiple names with more detail.

The fields that are used for defining a distinct name are:

Canonical
Rank
Authors
Year
Genus
Species
GoverningCode

Genus and Species are not fields of a name, but are calculated fields based on parent concepts.The theory with including these fields is to ensure and sub generic name or sub specific name matches other sub generic/specific names that do not have exactly the same parent hierarchy.

For example:

Name 1: Aus bus var. cus

Aus, genus
- bus , species
  - cus, variety

Name 2: Aus bus xus var. cus

Aus, genus
- bus, species
  - xus, subspecies
    - cus, variety

The fields for these 2 names will be:

Name	Canonical	Rank	Authors	Year	Genus	Species	Governing Code
1	cus	var.			Aus	bus	ICBN
2	cus	var.			Aus	bus	ICBN

So according to these fields, the names will match even though the direct parents of the 2 'cus' names are different, which is correct.

Another example:

Name 1: Lecanorales Nannf., order

Ascomycetes, class
- Lecanorales Nannf., order

Name 2: Lecanorales, order

Ascomycetes, class
- Lecanoromycetidae, subclass
  - Lecanorales, order

Name	Canonical	Rank	Authors	Year	Genus	Species	Governing Code
1	Lecanorales	order	Nannf.				ICBN
2	Lecanorales	order					ICBN

Again, will match even though the parent names are definied to be different. Again this is correct.

Generating Consensus Records

...

Versions Compared

Old Version 4

New Version Current

Key

NZOR Data Integration

Workflow

UML Diagram

UML Diagram

NZOR Integration Fields

Attachment Points

Matching

Diagram:
Image Modified

Matching Components

Parent

Rank

Canonical

Authors

Year

Nomenclatural Status

Integration By SQL

Generating Consensus Records

Page Comparison

Versions Compared

Old Version 4

New Version Current

Key

NZOR Data Integration

Workflow

UML Diagram

UML Diagram

NZOR Integration Fields

Attachment Points

Matching

Diagram: Image Modified Matching Components

Parent

Rank

Canonical

Authors

Year

Nomenclatural Status

Integration By SQL

Generating Consensus Records

Diagram:
Image Modified

Matching Components