We are storing representations of a wide array of publication types in the GeoKB where our focus is on two major aspects of building on these entities:
- Building in confirmed linkages from publications to other entities in the knowledge graph (people as contributors, organizations as funders/owners, named geographic places, and links to other assets (other pubs, datasets, etc.). In source catalogs, these linkages are often fuzzy in the current state in that they do not include persistent, resolvable identifiers but merely text strings. In our processing work, we are only bringing in linkages we can confirm and make real.
- We also need a place to store value-added extractions from publications using various natural language processing methods. Our current focus is on identifying more complete and focused geographic places and their significance to the content of the publication, organizational entities who are either collaborators on a study or stakeholders toward which the content of a publication is directed, and scientific methods and techniques described in the text that can be linked to citable references.
Publication Sources
Publication Schema Elements
Label, description, aliases
By and large, the title of a publication is used as its label in the knowledge graph representation. Descriptions are generally derived as an additional piece of visual information about the particular publication and its context. Titles and descriptions are truncated for label use in cases where they are longer than 250 characters. Aliases may include additional information such as special USGS codes for its reports that are sometimes used in practice as a finding aid.
Classification
This item serves as the top level set of subclasses for instance of classification of publication items. It includes the slate of USGS Numbered Series report types as we are pulling those in comprehensively.
PREFIX wd: <https://geokb.wikibase.cloud/entity/>
PREFIX wdt: <https://geokb.wikibase.cloud/prop/direct/>
SELECT ?class ?classLabel
WHERE {
?class wdt:P2* wd:Q6 .
SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
}
Identifiers
- Digital Object Identifiers - Generally considered to be a persistent, unique, resolvable identifier specific to the publication item. There are exceptions where the DOI registration was not managed with this goal in mind and more than one item in the GeoKB can be associated with the same DOI.
- USGS Publications Warehouse indexId - An internal ID for the Pubs Warehouse that keeps us in sync with that platform.
- Zotero Group Library/Item ID - Certain Zotero Group Libraries are registered through the GeoKB as GeoArchive sources, meaning we count on that system as our primary source for citation metadata and access to content.
- GeoDeepDive ID (GDDID) - Publications that have been processed through GeoDeepDive/xDD pipelines have a unique GUID from that digital library representing the NLP digestion, vectorization, and other processing completed for a given pub through which we are pulling back derived claims.
Publication Year
While there can be other dates associated with a publication in some catalogs, we focus solely on the 4 digit publication year used in citation metadata.
Contributors
Contributor links come in the form of authors, editors, and compilers. These are always linked to a person entity that is also represented in the GeoKB. Links are accomplished solely through confirmed ORCID identifiers in citation metadata. Even though this can leave out some contributors, we have to draw a line at where we can make confirmed (without guessing) linkages from people to publications they have contributed.
Funders/Owners
We build linkages from publications to organizations via the funder and owner properties. Funders are mostly limited to official USGS Programs as one of the major questions the GeoKB is seeking to answer (what scientific products to the line-item Programs in USGS support?). Owners are the other organizational units in the USGS who were responsible for the production of a product and see to it's long-term maintenance and viability for use.
We will be extending the organization concept in the GeoKB beyond USGS organizations as we build out confirmed links to collaborator and stakeholder type organizations. This will include publications as we examine co-author affiliations and other extractions indicating collaborating institutions.
Content URLs
One thing we are working to bring some clarity on for publications is the linkage to underlying content. We established a content URL property to house what we have found to be confirmed links to online content, leaving out other links that may be in metadata but are no longer working. The main purpose of this is to support processes to retrieve content, complete with explicit context as to the significance or classification of the content, that can be used in NLP and image processing pipelines.
Additional text content
Another thing we are experimenting with is the use of the "item talk" wiki pages associated with items to house meaningful content directly associated with or from publication items. This content is sometimes not as accessible as it could be for NLP processes and other text analysis uses. By storing things like abstracts and tables of contents that we have in metadata directly with an item in a machine readable form (as transformable wiki markup), we can also work the wiki pages into machine learning pipelines.