Item talk:Q3

From geokb
Revision as of 16:30, 28 February 2024 by Sky (talk | contribs) (→‎Analytical Uses)

A representation of people associated with other items in the GeoKB is created using publicly available information from the USGS Staff Profiles, ORCiD records, and public catalogs where people are listed as authors/contributors to published works. The items contain claims for external identifiers like ORCiD ID used to link them with other items and email addresses when those are already part of the online public record. There are items in the GeoKB for people who are no longer affiliated with the USGS but are linked to other information as part of the historic record, and certain transient identifiers such as email or profile page links may no longer be valid but are retained with a date qualifier indicating when they were known to be valid. The GeoKB does not include a comprehensive set of all current/former USGS staff; it only includes those staff who have elected to have certain professional information in a public forum such as by obtaining an ORCiD identifier and making that information public as an author. We also include records for people with ORCID identifiers who are listed as co-authors on USGS publications but who are not USGS staff. Some of these are close, long-term collaborators from academic institution partners.

Caching raw data

The information we are using to build representations of people comes from a couple of different online sources.

  • USGS Personnel Profile pages (via a web scraping routine)
  • ORCID records

We are iteratively working through these sources to determine what other information needs to be represented in the GeoKB such as the types of expertise a person claims to have on their profile pages. To help facilitate this process, we are experimenting with a further use of the Item Talk wiki pages where we run a data gathering algorithm and then write a data structure in YAML to the Item Talk pages as a cache to work through with other processing. This gives us a contextual data store (meaning that a person's details pulled from other sources are stored with their representative item in the knowledgebase) that we can pull from to build claims about the person.

Storing raw, yet-to-be-processed content in wiki pages has the added benefit of immediately adding content to the overall search index in the Wikibase instance while we hash through additional reference data mapping. This can be accessed directly using the search form or Mediawiki API. While this is not as specific and transitive across the entire knowledgebase as a SPARQL approach, it does open up possibilities for additional use patterns to leverage larger chunks of text or data structures not yet fully digested into the knowledge representation.

Periodic Updates

The caching process is managed through the statements for a person recording their USGS profile URL (P31 "reference URL") and ORCID (P106). To aid in managing automated caching tasks, we record a retrieved (P139) date specifying when we last retrieved information from the foreign resource and a status code (P151) indicating if the resource was available or not. This information is also in the "meta" section of the cached YAML structure, but we place it into the source claims so that we can query on it using something like the following SPARQL.

PREFIX wd: <https://geokb.wikibase.cloud/entity/>
PREFIX wdt: <https://geokb.wikibase.cloud/prop/direct/>
PREFIX p: <https://geokb.wikibase.cloud/prop/>
PREFIX ps: <https://geokb.wikibase.cloud/prop/statement/>
PREFIX pq: <https://geokb.wikibase.cloud/prop/qualifier/>

SELECT ?item ?itemLabel ?profile_url ?retrieved ?status_code
WHERE {
  ?item wdt:P1 wd:Q3 ;
        wdt:P31 ?profile_url ;
        p:P31 ?ref_url_statement .
  OPTIONAL {
    ?ref_url_statement pq:P151 ?status_code ;
                       pq:P139 ?retrieved .
  }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

Try it!


Analytical Uses

One of the things that bringing our many disparate (meta)data sources together in graph form supports is network analysis in various ways. One of these areas of interest is in our human networks of people working across organizational boundaries. We can see this show up in various ways by examining where people intersect in co-authoring papers, working on projects together, and potentially where we are working on some of the same things but perhaps not collaborating as much as we could be.

Queries

The following is a basic query pattern that can be used to find cases where staff from different Science Centers have collaborated on products. The query generates a graph with labels for the intersection of people co-authoring creative works within a recent time range. It uses query criteria and filters to identify cases where people are working together across organizational units. This particular case examines co-authors who are affiliated with one of the organizational units in the Midcontinent Region.

PREFIX ge: <https://geokb.wikibase.cloud/entity/>
PREFIX gp: <https://geokb.wikibase.cloud/prop/direct/>

SELECT ?author1 ?organization1 ?article ?author2 ?organization2
?author1Label ?organization1Label ?articleLabel ?author2Label ?organization2Label
WHERE {
  ?article gp:P102 ?author1, ?author2 . # Get articles authored by multiple people
  ?author1 gp:P108 ?organization1 . # Get author 1's organizational affiliation
  ?author2 gp:P108 ?organization2 . # Get author 2's organizational affiliation
  ?organization1 gp:P190* ge:Q44363 . # Restrict author 1's organization to subsidiaries of the Midcontinent Region
  ?organization2 gp:P190* ge:Q44363 . # Restrict author 2's organization to subsidiaries of the Midcontinent Region
  ?article gp:P7 ?publication_date . # Get the publication date for the articles
  FILTER (?author1 != ?author2 && ?organization1 != ?organization2 && YEAR(?publication_date) > 2017) # Filter to pubs since 2017 and tease out authors in different organizations
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
}

Try it!


An important note here is that the affiliation information comes mostly from the latest affiliation we know about for a person from a personnel organization standpoint. The linkage comes from the stated organization(s) that a person is a part of from scraping their staff profiles. This is not necessarily fully current and does not reflect the full history of affiliations that a person has had. It is also not based on the stated affiliation for authors at the time of publishing an article or report.

The knowledgebase model we are working with here supports the dynamic of recording many affiliations for a given person with the potential for robust time bounds if that information could be surfaced. But we have challenges in information quality and completeness that have not been overcome. Affiliations for authors are sometimes recorded in the Publications Warehouse but these are essentially name-only identifiers for organizational units, not all of which have been disambiguated and tied to a usable identifier of some kind. What we would want to be able to do in an ideal world for something like this analysis is look specifically at the affiliation of co-authors at the time of co-authoring a publication, but we need more sophistication in our underlying characterization to do so comprehensively.