Item talk:Q44323
USGS Staff Profiles are a primary source of information on USGS employees, contractors, and volunteers. These are the public web pages that staff build and maintain for themselves. They contain basic identifying information, including name, email address, ORCID, organization affiliation, and job title; terms describing expertise and listings of educational and professional experience; and narrative bio content. This content is scraped routinely from the web pages, transformed into a schema.org content structure, cached as JSON-LD on person item wiki pages, and used to generate the following:
Derived Claims
- Name is used verbatim for the label as the presumed preference on the part of the individual
- Description is built as a combination of job title and organizational affiliation, intended as a quick source of orientation and disambiguation within the knowledge graph
- Entity is presumed to be instance of human (with some work to weed out "staff profiles" for non-humans)
- Employer is presumed to be USGS
- point in time qualifier is set based on the last time a profile was scraped and found to represent a current staff member
- end time qualifier may be set when it is determined that a employment status is no longer current (e.g., when staff profile disappears from the inventory or is no longer accessible or staff profile content indicates "former employee"); this is generally set to the special "unknown value" as we have no specific public information on separation dates
- is affiliated with property is set to one or more organizations that a person shows as their primary or additional organizations; the linkage is based on the set of USGS organization items (subclasses of Item:Q50862:USGS organization also built from scraped web pages) and established via URL
- point in time and end time qualifiers also used here
- affiliation claims are never removed once established, leaving a somewhat improved record on where people have been located over time
- occupation claims are established based on job titles (associated with primary or additional organizations) and a set of classification items built as subclass of human, using both primary label and aliases for exact term matches
- an additional job title connection is made for supervisor based on the presence of Supervisor/Supervisory terms in job titles
- occupation claims also leverage the point in time qualifier, indicating when they were last encountered, and are not removed once created
- email address claims record the email address listed for USGS personnel
- Point in time qualifiers are used to indicate when the email address was found and presumed to be viable
- previous address qualifiers may be included when email addresses are found to change or be in different forms over time for the same individual as these can be used to establish linkages with some other content sources
- ORCID iD claims record ORCID identifiers
- ORCID claims may include retrieved, last update, and status code qualifiers indicating interactions with the ORCID registry/API. These indicate when an ORCID record was retrieved and cached, what the HTTP status code was on the request (providing an indication of where an ORCID may be invalid), and when information derived from the ORCID records was used.
- official website claims record the URL for the staff profile
- URL claims may include retrieved, last update, and status code qualifiers indicating interactions with the profile page itself. These indicate when a record was retrieved and cached, what the HTTP status code was on the request (with 404 or 403 indicating that a person has potentially left USGS service), and when information derived from the web page was used.
- There is a highly inconsistent and problematic way that the USGS web system has in dealing with the URL space for profile pages. Previous address claims can be used to record alternate forms of the URL for the same staff person, some of which may still be valid, with or without an HTTP redirect. There are cases where a URL once established for one unique individual may somehow resolve to a page presenting information for someone completely different.
- knows about claims record links to items that are part of the GeoKB ontology describing topics of various kinds. These characteristics of a person are partially derived from the expertise terms listed in profile pages where they can be matched either directly or with a high degree of confidence (via vector similarity) to terms in the GeoKB. Expertise terms account for a relatively small percentage of knows about claims as they are highly variable in structure and content and are not aligned with any specific vocabulary at the source, resulting in the meaning/intent of the terms being ambiguous.
Reference
The URL for the profile page is used as a reference URL reference for claims derived directly from the page content.
Source Cache
A derived form of the web page content is organized and encoded (JSON-LD) into a schema.org/Person document and stored in the wiki page for a person item under the "USGS Staff Profile" key. Other documents cached include "ORCID" and "OpenAlex." History is preserved for this cache, providing different versions of the staff profile content retrieved and processed over time.
An intermediary source processing database is also maintained as part of the GeoKB architecture in a MongoDB document store. The staff_profiles collection contains the schema documents along with source HTML, status code check history, and other details used and created in processing.