A representation of people associated with other items in the GeoKB is created using publicly available information from the USGS Staff Profiles, ORCiD records, and public catalogs where people are listed as authors/contributors to published works. The items contain claims for external identifiers like ORCiD ID used to link them with other items and email addresses when those are already part of the online public record. There are items in the GeoKB for people who are no longer affiliated with the USGS but are linked to other information as part of the historic record, and certain transient identifiers such as email or profile page links may no longer be valid but are retained with a date qualifier indicating when they were known to be valid. The GeoKB does not include a comprehensive set of all current/former USGS staff; it only includes those staff who have elected to have certain professional information in a public forum such as by obtaining an ORCiD identifier and making that information public as an author. We also include records for people with ORCID identifiers who are listed as co-authors on USGS publications but who are not USGS staff. Some of these are close, long-term collaborators from academic institution partners.
Caching raw data
The information we are using to build representations of people comes from a couple of different online sources.
- USGS Personnel Profile pages (via a web scraping routine)
- ORCID records
We are iteratively working through these sources to determine what other information needs to be represented in the GeoKB such as the types of expertise a person claims to have on their profile pages. To help facilitate this process, we are experimenting with a further use of the Item Talk wiki pages where we run a data gathering algorithm and then write a data structure in YAML to the Item Talk pages as a cache to work through with other processing. This gives us a contextual data store (meaning that a person's details pulled from other sources are stored with their representative item in the knowledgebase) that we can pull from to build claims about the person.
Storing raw, yet-to-be-processed content in wiki pages has the added benefit of immediately adding content to the overall search index in the Wikibase instance while we hash through additional reference data mapping. This can be accessed directly using the search form or Mediawiki API. While this is not as specific and transitive across the entire knowledgebase as a SPARQL approach, it does open up possibilities for additional use patterns to leverage larger chunks of text or data structures not yet fully digested into the knowledge representation.
Periodic Updates
The caching process is managed through the statements for a person recording their USGS profile URL (P31 "reference URL") and ORCID (P106). To aid in managing automated caching tasks, we record a retrieved (P139) date specifying when we last retrieved information from the foreign resource and a status code (P151) indicating if the resource was available or not. This information is also in the "meta" section of the cached YAML structure, but we place it into the source claims so that we can query on it using something like the following SPARQL.
PREFIX wd: <https://geokb.wikibase.cloud/entity/>
PREFIX wdt: <https://geokb.wikibase.cloud/prop/direct/>
PREFIX p: <https://geokb.wikibase.cloud/prop/>
PREFIX ps: <https://geokb.wikibase.cloud/prop/statement/>
PREFIX pq: <https://geokb.wikibase.cloud/prop/qualifier/>
SELECT ?item ?itemLabel ?profile_url ?retrieved ?status_code
WHERE {
?item wdt:P1 wd:Q3 ;
wdt:P31 ?profile_url ;
p:P31 ?ref_url_statement .
OPTIONAL {
?ref_url_statement pq:P151 ?status_code ;
pq:P139 ?retrieved .
}
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}