Item talk:Q3: Difference between revisions

From geokb
No edit summary
 
(9 intermediate revisions by the same user not shown)
Line 1: Line 1:
A representation of people associated with other items in the GeoKB is created using publicly available information from the [https://www.usgs.gov/connect/staff-profiles USGS Staff Profiles], ORCiD records, and public catalogs where people are listed as authors/contributors to published works. The items contain claims for external identifiers like ORCiD ID used to link them with other items and email addresses when those are already part of the online public record. There are items in the GeoKB for people who are no longer affiliated with the USGS but are linked to other information as part of the historic record, and certain transient identifiers such as email or profile page links may no longer be valid but are retained with a date qualifier indicating when they were known to be valid. The GeoKB does not include a comprehensive set of all current/former USGS staff; it only includes those staff who have elected to have certain professional information in a public forum such as by obtaining an ORCiD identifier and making that information public as an author. We also include records for people with ORCID identifiers who are listed as co-authors on USGS publications but who are not USGS staff. Some of these are close, long-term collaborators from academic institution partners.
A representation of people associated with other items in the GeoKB is created using publicly available information from the [https://www.usgs.gov/connect/staff-profiles USGS Staff Profiles], ORCiD records, and any public catalogs where people are listed as authors/contributors to published works, have information associating them as a current or former USGS staff person, and can be resolved to a source such as OpenAlex. The items contain claims for external identifiers like ORCiD ID used to link them with other items and email addresses when those are already part of the online public record. There are items in the GeoKB for people who are no longer affiliated with the USGS but are linked to other information as part of the historic record, and certain transient identifiers such as email or profile page links may no longer be valid but are retained with a date qualifier indicating when they were known to be valid. The GeoKB does not include a comprehensive set of all current/former USGS staff; it only includes those staff who have elected to have certain professional information in a public forum such as by obtaining an ORCiD identifier and making that information public as an author.


= Caching raw data =
Person items receive an employer claim linking to the USGS. We attempt to indicate whether employees are current or not using an end time qualifier that generally uses the special "unknown" value type.
The information we are using to build representations of people comes from a couple of different online sources.
 
* USGS Personnel Profile pages (via a web scraping routine)
= Caching raw data as schema.org documents =
The information we are using to build representations of people comes several sources.
 
* USGS Staff Profile pages (via a web scraping routine)
* ORCID records
* ORCID records
* OpenAlex records


We are iteratively working through these sources to determine what other information needs to be represented in the GeoKB such as the types of expertise a person claims to have on their profile pages. To help facilitate this process, we are experimenting with a further use of the Item Talk wiki pages where we run a data gathering algorithm and then write a data structure in YAML to the Item Talk pages as a cache to work through with other processing. This gives us a contextual data store (meaning that a person's details pulled from other sources are stored with their representative item in the knowledgebase) that we can pull from to build claims about the person.
In the case of USGS Staff Profiles, our primary source for personnel information in this knowledge graph, we have no programmatic or structured data access path and must use a web scraper to pull from pages periodically. In striving toward an ideal we'd like to see in future, we have started organizing all of the scraped content into notional [https://schema.org/Person schema.org/Person] documents. These are cached to the associated "item talk" pages for the person entity and then used from that state to set labels, descriptions, aliases, and claims.


Storing raw, yet-to-be-processed content in wiki pages has the added benefit of immediately adding content to the overall search index in the Wikibase instance while we hash through additional reference data mapping. This can be accessed directly using the search form or Mediawiki API. While this is not as specific and transitive across the entire knowledgebase as a SPARQL approach, it does open up possibilities for additional use patterns to leverage larger chunks of text or data structures not yet fully digested into the knowledge representation.
In the case of ORCID, we have native JSON-LD Person documents already and are shifting to storing those from older methods that used the full ORCID JSON structure. ORCID records are pulled and cached for all person entities who have a recorded ORCID identifier.
 
OpenAlex "A" identifiers are recorded for person entities where these have been established through either an ORCID identifier from another source (e.g., Staff Profiles) or through another means (e.g., examining authorship on publications linked to OpenAlex works). Raw source OpenAlex information is stored in item talk pages in its native JSON structure.
 
Storing raw content, not all of which is processable into the knowledge graph, in wiki pages has the added benefit of immediately adding content to the full text search index in the Wikibase instance while we hash through additional reference data mapping. This can be accessed directly using the search form or Mediawiki API. While this is not as specific and transitive across the entire knowledgebase as a SPARQL approach, it does open up possibilities for additional use patterns to leverage larger chunks of text or data structures not yet fully digested into the knowledge representation.


== Periodic Updates ==
== Periodic Updates ==
The caching process is managed through the statements for a person recording their USGS profile URL (P31 "reference URL") and ORCID (P106). To aid in managing automated caching tasks, we record a retrieved (P139) date specifying when we last retrieved information from the foreign resource and a status code (P151) indicating if the resource was available or not. This information is also in the "meta" section of the cached YAML structure, but we place it into the source claims so that we can query on it using something like the following SPARQL.
The caching process is managed through the statements for a person recording their USGS profile URL (P145 "official website"), ORCID (P106), and OpenAlex (P205). To aid in managing automated caching tasks, we record three details in qualifiers on the official website claims:
* last update (P129) date indicating when we last ran an HTTP operation on the URL
* status code (P151) indicating if the resource was available or not on the last update date
* retrieved (P139) date specifying when we last retrieved information from the official website source


<sparql tryit="1">
<sparql tryit="1">
PREFIX geokbe: <https://geokb.wikibase.cloud/entity/>
PREFIX geokbp: <https://geokb.wikibase.cloud/prop/direct/>
PREFIX p: <https://geokb.wikibase.cloud/prop/>
PREFIX p: <https://geokb.wikibase.cloud/prop/>
PREFIX ps: <https://geokb.wikibase.cloud/prop/statement/>
PREFIX pq: <https://geokb.wikibase.cloud/prop/qualifier/>
PREFIX pq: <https://geokb.wikibase.cloud/prop/qualifier/>


SELECT ?item ?itemLabel ?retrieved
SELECT ?item ?url ?last_update ?retrieved ?status_code
WHERE {
WHERE {
   ?item p:P31 ?ref_url_statement .
   ?item geokbp:P1 geokbe:Q3 ;
   ?ref_url_statement pq:P151 ?status_code ;
        geokbp:P145 ?url ;
                    pq:P139 ?retrieved .
        p:P145 ?url_statement .
   FILTER (STR(?status_code) = '200')
   OPTIONAL {
   FILTER (xsd:dateTime(?retrieved) > xsd:dateTime("2020-01-01T00:00:00Z"))
    ?url_statement pq:P129 ?last_update .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
  }
  OPTIONAL {
    ?url_statement pq:P139 ?retrieved .
   }
  OPTIONAL {
    ?url_statement pq:P151 ?status_code .
  }
   FILTER(regex(str(?url), "staff-profiles"))
}
}
</sparql>
</sparql>
== Update Philosophy ==
We take an additive approach on linked information content. Once a claim/statement about a person has been set, the normal operation is to retain that information for all time unless it is explicitly refuted somehow. We use the point in time qualifier to indicate that something was assumed to be true on that date. So, if a person's staff profile indicates an assertion that they know about a topic based on linking expertise terms (encoded as schema.org/knowsAbout) or working through other content, we record the date we developed that assertion and assume it to be the case. If, at a later time, a person has changed their expertise terms, we keep anything that was there previously, update point in time qualifiers, add new linkages, but retain anything that disappeared. There is no concept of "no longer knows about."
== Previous Address ==
One of the challenges in working with the staff profiles content comes from how the URL space is being managed. URLs for the same person change and show up over time in inconsistent ways. Sometimes an older URL will get a 301 redirect to a new URL, but other times, both an old and new URL will still return 200 responses. In other cases, old URLs disappear entirely (404 responses), and in other cases these return a 403 "forbidden" response, indicating that something is probably still there but inaccessible anonymously.
We have had to deal with this issue in a variety of ways, not all of which can be automated (yet). After initially recording multiple official website claims, we ended up remodeling the content to record only a single official website for a person and then using a "previous address" qualifier to record additional URLs that may or may not be valid for a person. When a URL stops returning a functional response (200 or 301), we end up needing to investigate at some point. The retrieved date is an indicator of when a person may have separated from the USGS for some reason, and then we have internal information about this that can be consulted.
== Schema.org Person profile ==
We had to make some judgment calls in building the "notional" schema.org documents for person entities that should be revisited when USGS (hopefully) makes this part of the staff profile system architecture. The following is an overview of the major decisions and nuances in the approach:
* No significant processing is done between what is scraped from the web pages and what goes into the schema.org JSON-LD encoded documents. We do a little bit of text cleanup and parsing, but everything is left as it is presented.
* Email and ORCID identifiers are the two major distinguishing identifiers pulled in through this process. We know from experience that neither of these can be absolutely counted on as being correct. There have been cases of the same ORCID showing up for multiple people, only one of which is correct. Other cases have included invalid ORCID identifiers. We record email in the [https://schema.org/email email property]. ORCID identifiers are stored as an [https://schema.org/identifier identifier] using the [https://schema.org/PropertyValue PropertyValue] syntax with "ORCID" used as [https://schema.org/propertyID propertyID].
* We also record the GeoKB QID identifier in URL form as an identifier, giving us the linkage between the profile schema doc and the entity representation in the knowledge graph.
* We encode the basic relationship of a person with the USGS using [https://schema.org/memberOf memberOf]. The presumption is that each person with a USGS Staff Profile was considered a member of the USGS at some point in time. We record the date that a person was still present with a staff profile in the startDate property of the memberOf @OrganizationalRole document as a convention to include a date. This is nominally inaccurate in that it is not actually the start date for a person's role. We translate this into a point in time qualifier in the GeoKB, indicating that at least at the recorded point in time, someone was a member of the USGS. The whole issue of date range of employment is something that could be much improved by incorporating this information legitimately into the Staff Profile system. On the other end, we end up assuming that someone is no longer a member of the USGS when their Staff Profile page "goes missing" or we pick up a clue in some (but not all) more recent Staff Profiles when the maintainers delete all of what someone had in their profile and change their title to add "(former employee)."
* Name, url, and jobTitle are taken verbatim with only basic text cleanup to remove leading/trailing spaces and transform encoding of special characters.
= Analytical Uses =
One of the things that bringing our many disparate (meta)data sources together in graph form supports is network analysis in various ways. One of these areas of interest is in our human networks of people working across organizational boundaries. We can see this show up in various ways by examining where people intersect in co-authoring papers, working on projects together, and potentially where we are working on some of the same things but perhaps not collaborating as much as we could be.
== Queries ==
The following is a basic query pattern that can be used to find cases where staff from different Science Centers have collaborated on products. The query generates a graph with labels for the intersection of people co-authoring creative works within a recent time range. It uses query criteria and filters to identify cases where people are working together across organizational units. This particular case examines co-authors who are affiliated with one of the organizational units in the Midcontinent Region.
<sparql tryit="1">
PREFIX ge: <https://geokb.wikibase.cloud/entity/>
PREFIX gp: <https://geokb.wikibase.cloud/prop/direct/>
SELECT ?author1 ?organization1 ?article ?author2 ?organization2
?author1Label ?organization1Label ?articleLabel ?author2Label ?organization2Label
WHERE {
  ?article gp:P102 ?author1, ?author2 . # Get articles authored by multiple people
  ?author1 gp:P108 ?organization1 . # Get author 1's organizational affiliation
  ?author2 gp:P108 ?organization2 . # Get author 2's organizational affiliation
  ?organization1 gp:P190* ge:Q44363 . # Restrict author 1's organization to subsidiaries of the Midcontinent Region
  ?organization2 gp:P190* ge:Q44363 . # Restrict author 2's organization to subsidiaries of the Midcontinent Region
  ?article gp:P7 ?publication_date . # Get the publication date for the articles
  FILTER (?author1 != ?author2 && ?organization1 != ?organization2 && YEAR(?publication_date) > 2017) # Filter to pubs since 2017 and tease out authors in different organizations
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
}
</sparql>
An important note here is that the affiliation information comes mostly from the latest affiliation we know about for a person from a personnel organization standpoint. The linkage comes from the stated organization(s) that a person is a part of from scraping their staff profiles. This is not necessarily fully current and does not reflect the full history of affiliations that a person has had. It is also not based on the stated affiliation for authors at the time of publishing an article or report.
The knowledgebase model we are working with here supports the dynamic of recording many affiliations for a given person with the potential for robust time bounds if that information could be surfaced. But we have challenges in information quality and completeness that have not been overcome. Affiliations for authors are sometimes recorded in the Publications Warehouse but these are essentially name-only identifiers for organizational units, not all of which have been disambiguated and tied to a usable identifier of some kind. What we would want to be able to do in an ideal world for something like this analysis is look specifically at the affiliation of co-authors at the time of co-authoring a publication, but we need more sophistication in our underlying characterization to do so comprehensively.

Latest revision as of 19:17, 20 August 2024

A representation of people associated with other items in the GeoKB is created using publicly available information from the USGS Staff Profiles, ORCiD records, and any public catalogs where people are listed as authors/contributors to published works, have information associating them as a current or former USGS staff person, and can be resolved to a source such as OpenAlex. The items contain claims for external identifiers like ORCiD ID used to link them with other items and email addresses when those are already part of the online public record. There are items in the GeoKB for people who are no longer affiliated with the USGS but are linked to other information as part of the historic record, and certain transient identifiers such as email or profile page links may no longer be valid but are retained with a date qualifier indicating when they were known to be valid. The GeoKB does not include a comprehensive set of all current/former USGS staff; it only includes those staff who have elected to have certain professional information in a public forum such as by obtaining an ORCiD identifier and making that information public as an author.

Person items receive an employer claim linking to the USGS. We attempt to indicate whether employees are current or not using an end time qualifier that generally uses the special "unknown" value type.

Caching raw data as schema.org documents

The information we are using to build representations of people comes several sources.

  • USGS Staff Profile pages (via a web scraping routine)
  • ORCID records
  • OpenAlex records

In the case of USGS Staff Profiles, our primary source for personnel information in this knowledge graph, we have no programmatic or structured data access path and must use a web scraper to pull from pages periodically. In striving toward an ideal we'd like to see in future, we have started organizing all of the scraped content into notional schema.org/Person documents. These are cached to the associated "item talk" pages for the person entity and then used from that state to set labels, descriptions, aliases, and claims.

In the case of ORCID, we have native JSON-LD Person documents already and are shifting to storing those from older methods that used the full ORCID JSON structure. ORCID records are pulled and cached for all person entities who have a recorded ORCID identifier.

OpenAlex "A" identifiers are recorded for person entities where these have been established through either an ORCID identifier from another source (e.g., Staff Profiles) or through another means (e.g., examining authorship on publications linked to OpenAlex works). Raw source OpenAlex information is stored in item talk pages in its native JSON structure.

Storing raw content, not all of which is processable into the knowledge graph, in wiki pages has the added benefit of immediately adding content to the full text search index in the Wikibase instance while we hash through additional reference data mapping. This can be accessed directly using the search form or Mediawiki API. While this is not as specific and transitive across the entire knowledgebase as a SPARQL approach, it does open up possibilities for additional use patterns to leverage larger chunks of text or data structures not yet fully digested into the knowledge representation.

Periodic Updates

The caching process is managed through the statements for a person recording their USGS profile URL (P145 "official website"), ORCID (P106), and OpenAlex (P205). To aid in managing automated caching tasks, we record three details in qualifiers on the official website claims:

  • last update (P129) date indicating when we last ran an HTTP operation on the URL
  • status code (P151) indicating if the resource was available or not on the last update date
  • retrieved (P139) date specifying when we last retrieved information from the official website source
PREFIX geokbe: <https://geokb.wikibase.cloud/entity/>
PREFIX geokbp: <https://geokb.wikibase.cloud/prop/direct/>
PREFIX p: <https://geokb.wikibase.cloud/prop/>
PREFIX pq: <https://geokb.wikibase.cloud/prop/qualifier/>

SELECT ?item ?url ?last_update ?retrieved ?status_code 
WHERE {
  ?item geokbp:P1 geokbe:Q3 ;
        geokbp:P145 ?url ;
        p:P145 ?url_statement .
  OPTIONAL {
    ?url_statement pq:P129 ?last_update .
  }
  OPTIONAL {
    ?url_statement pq:P139 ?retrieved .
  }
  OPTIONAL {
    ?url_statement pq:P151 ?status_code .
  }
  FILTER(regex(str(?url), "staff-profiles"))
}

Try it!


Update Philosophy

We take an additive approach on linked information content. Once a claim/statement about a person has been set, the normal operation is to retain that information for all time unless it is explicitly refuted somehow. We use the point in time qualifier to indicate that something was assumed to be true on that date. So, if a person's staff profile indicates an assertion that they know about a topic based on linking expertise terms (encoded as schema.org/knowsAbout) or working through other content, we record the date we developed that assertion and assume it to be the case. If, at a later time, a person has changed their expertise terms, we keep anything that was there previously, update point in time qualifiers, add new linkages, but retain anything that disappeared. There is no concept of "no longer knows about."

Previous Address

One of the challenges in working with the staff profiles content comes from how the URL space is being managed. URLs for the same person change and show up over time in inconsistent ways. Sometimes an older URL will get a 301 redirect to a new URL, but other times, both an old and new URL will still return 200 responses. In other cases, old URLs disappear entirely (404 responses), and in other cases these return a 403 "forbidden" response, indicating that something is probably still there but inaccessible anonymously.

We have had to deal with this issue in a variety of ways, not all of which can be automated (yet). After initially recording multiple official website claims, we ended up remodeling the content to record only a single official website for a person and then using a "previous address" qualifier to record additional URLs that may or may not be valid for a person. When a URL stops returning a functional response (200 or 301), we end up needing to investigate at some point. The retrieved date is an indicator of when a person may have separated from the USGS for some reason, and then we have internal information about this that can be consulted.

Schema.org Person profile

We had to make some judgment calls in building the "notional" schema.org documents for person entities that should be revisited when USGS (hopefully) makes this part of the staff profile system architecture. The following is an overview of the major decisions and nuances in the approach:

  • No significant processing is done between what is scraped from the web pages and what goes into the schema.org JSON-LD encoded documents. We do a little bit of text cleanup and parsing, but everything is left as it is presented.
  • Email and ORCID identifiers are the two major distinguishing identifiers pulled in through this process. We know from experience that neither of these can be absolutely counted on as being correct. There have been cases of the same ORCID showing up for multiple people, only one of which is correct. Other cases have included invalid ORCID identifiers. We record email in the email property. ORCID identifiers are stored as an identifier using the PropertyValue syntax with "ORCID" used as propertyID.
  • We also record the GeoKB QID identifier in URL form as an identifier, giving us the linkage between the profile schema doc and the entity representation in the knowledge graph.
  • We encode the basic relationship of a person with the USGS using memberOf. The presumption is that each person with a USGS Staff Profile was considered a member of the USGS at some point in time. We record the date that a person was still present with a staff profile in the startDate property of the memberOf @OrganizationalRole document as a convention to include a date. This is nominally inaccurate in that it is not actually the start date for a person's role. We translate this into a point in time qualifier in the GeoKB, indicating that at least at the recorded point in time, someone was a member of the USGS. The whole issue of date range of employment is something that could be much improved by incorporating this information legitimately into the Staff Profile system. On the other end, we end up assuming that someone is no longer a member of the USGS when their Staff Profile page "goes missing" or we pick up a clue in some (but not all) more recent Staff Profiles when the maintainers delete all of what someone had in their profile and change their title to add "(former employee)."
  • Name, url, and jobTitle are taken verbatim with only basic text cleanup to remove leading/trailing spaces and transform encoding of special characters.

Analytical Uses

One of the things that bringing our many disparate (meta)data sources together in graph form supports is network analysis in various ways. One of these areas of interest is in our human networks of people working across organizational boundaries. We can see this show up in various ways by examining where people intersect in co-authoring papers, working on projects together, and potentially where we are working on some of the same things but perhaps not collaborating as much as we could be.

Queries

The following is a basic query pattern that can be used to find cases where staff from different Science Centers have collaborated on products. The query generates a graph with labels for the intersection of people co-authoring creative works within a recent time range. It uses query criteria and filters to identify cases where people are working together across organizational units. This particular case examines co-authors who are affiliated with one of the organizational units in the Midcontinent Region.

PREFIX ge: <https://geokb.wikibase.cloud/entity/>
PREFIX gp: <https://geokb.wikibase.cloud/prop/direct/>

SELECT ?author1 ?organization1 ?article ?author2 ?organization2
?author1Label ?organization1Label ?articleLabel ?author2Label ?organization2Label
WHERE {
  ?article gp:P102 ?author1, ?author2 . # Get articles authored by multiple people
  ?author1 gp:P108 ?organization1 . # Get author 1's organizational affiliation
  ?author2 gp:P108 ?organization2 . # Get author 2's organizational affiliation
  ?organization1 gp:P190* ge:Q44363 . # Restrict author 1's organization to subsidiaries of the Midcontinent Region
  ?organization2 gp:P190* ge:Q44363 . # Restrict author 2's organization to subsidiaries of the Midcontinent Region
  ?article gp:P7 ?publication_date . # Get the publication date for the articles
  FILTER (?author1 != ?author2 && ?organization1 != ?organization2 && YEAR(?publication_date) > 2017) # Filter to pubs since 2017 and tease out authors in different organizations
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
}

Try it!


An important note here is that the affiliation information comes mostly from the latest affiliation we know about for a person from a personnel organization standpoint. The linkage comes from the stated organization(s) that a person is a part of from scraping their staff profiles. This is not necessarily fully current and does not reflect the full history of affiliations that a person has had. It is also not based on the stated affiliation for authors at the time of publishing an article or report.

The knowledgebase model we are working with here supports the dynamic of recording many affiliations for a given person with the potential for robust time bounds if that information could be surfaced. But we have challenges in information quality and completeness that have not been overcome. Affiliations for authors are sometimes recorded in the Publications Warehouse but these are essentially name-only identifiers for organizational units, not all of which have been disambiguated and tied to a usable identifier of some kind. What we would want to be able to do in an ideal world for something like this analysis is look specifically at the affiliation of co-authors at the time of co-authoring a publication, but we need more sophistication in our underlying characterization to do so comprehensively.