Item talk:Q3: Difference between revisions

From geokb
Line 3: Line 3:
= Caching raw data =
= Caching raw data =
The information we are using to build representations of people comes from a couple of different online sources.
The information we are using to build representations of people comes from a couple of different online sources.
* USGS Personnel Profile pages (via a web scraping routine)
* USGS Staff Profile pages (via a web scraping routine)
* ORCID records
* ORCID records


Line 9: Line 9:


Storing raw, yet-to-be-processed content in wiki pages has the added benefit of immediately adding content to the overall search index in the Wikibase instance while we hash through additional reference data mapping. This can be accessed directly using the search form or Mediawiki API. While this is not as specific and transitive across the entire knowledgebase as a SPARQL approach, it does open up possibilities for additional use patterns to leverage larger chunks of text or data structures not yet fully digested into the knowledge representation.
Storing raw, yet-to-be-processed content in wiki pages has the added benefit of immediately adding content to the overall search index in the Wikibase instance while we hash through additional reference data mapping. This can be accessed directly using the search form or Mediawiki API. While this is not as specific and transitive across the entire knowledgebase as a SPARQL approach, it does open up possibilities for additional use patterns to leverage larger chunks of text or data structures not yet fully digested into the knowledge representation.
For USGS Staff Profile pages, we transform scraped content into the schema.org standard for [https://schema.org/Person Person] along with other applicable types. This is a "hoped for" feature that we'd like to see develop eventually - JSON-LD exposed through personal landing pages. The content is not linked to anything at the time of harvest, so we embed name-only values into the JSON-LD rendering (stored as YAML on item talk pages). We then run a secondary process to develop linkages to known entities in the graph when we process claims.


== Periodic Updates ==
== Periodic Updates ==
The caching process is managed through the statements for a person recording their USGS profile URL (P31 "reference URL") and ORCID (P106). To aid in managing automated caching tasks, we record a retrieved (P139) date specifying when we last retrieved information from the foreign resource and a status code (P151) indicating if the resource was available or not. This information is also in the "meta" section of the cached YAML structure, but we place it into the source claims so that we can query on it using something like the following SPARQL.
The caching process is managed through the statements for a person recording their USGS profile URL (P145 "official website") and ORCID (P106). To aid in managing automated caching tasks, we record several pieces of information in qualifiers on the official website claims:
* last update (P129) date indicating when we last ran an HTTP operation on the URL
* status code (P151) indicating if the resource was available or not
* retrieved (P139) date specifying when we last retrieved information from the foreign resource


<sparql tryit="1">
<sparql tryit="1">
Line 17: Line 22:
PREFIX wdt: <https://geokb.wikibase.cloud/prop/direct/>
PREFIX wdt: <https://geokb.wikibase.cloud/prop/direct/>
PREFIX p: <https://geokb.wikibase.cloud/prop/>
PREFIX p: <https://geokb.wikibase.cloud/prop/>
PREFIX ps: <https://geokb.wikibase.cloud/prop/statement/>
PREFIX pq: <https://geokb.wikibase.cloud/prop/qualifier/>
PREFIX pq: <https://geokb.wikibase.cloud/prop/qualifier/>


SELECT ?item ?itemLabel ?profile_url ?retrieved ?status_code
SELECT ?item ?url ?last_update ?retrieved ?status_code  
WHERE {
WHERE {
   ?item wdt:P1 wd:Q3 ;
   ?item wdt:P1 wd:Q3 ;
         wdt:P31 ?profile_url ;
         wdt:P145 ?url ;
         p:P31 ?ref_url_statement .
         p:P145 ?url_statement .
  OPTIONAL {
    ?url_statement pq:P129 ?last_update .
  }
   OPTIONAL {
   OPTIONAL {
     ?ref_url_statement pq:P151 ?status_code ;
     ?url_statement pq:P139 ?retrieved .
                      pq:P139 ?retrieved .
   }
   }
   SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
   OPTIONAL {
    ?url_statement pq:P151 ?status_code .
  }
  FILTER(regex(str(?url), "staff-profiles"))
}
}
</sparql>
</sparql>
== Previous Address ==
One of the challenges in working with the staff profiles content comes from how the URL space is being managed. URLs for the same person change and show up over time in inconsistent ways. Sometimes an older URL will get a 301 redirect to a new URL, but other times, both an old and new URL will still return 200 responses. In other cases, old URLs disappear entirely (404 responses), and in other cases these return a 403 "forbidden" response, indicating that something is probably still there but inaccessible anonymously.
We have had to deal with this issue in a variety of ways, not all of which can be automated (yet). After initially recording multiple official website claims, we ended up remodeling the content to record only a single official website for a person and then using a "previous address" qualifier to record additional URLs that may or may not be valid for a person. When a URL stops returning a functional response (200 or 301), we end up needing to investigate at some point. The retrieved date is an indicator of when a person may have separated from the USGS for some reason, and then we have internal information about this that can be consulted.


= Analytical Uses =
= Analytical Uses =

Revision as of 11:04, 12 May 2024

A representation of people associated with other items in the GeoKB is created using publicly available information from the USGS Staff Profiles, ORCiD records, and public catalogs where people are listed as authors/contributors to published works. The items contain claims for external identifiers like ORCiD ID used to link them with other items and email addresses when those are already part of the online public record. There are items in the GeoKB for people who are no longer affiliated with the USGS but are linked to other information as part of the historic record, and certain transient identifiers such as email or profile page links may no longer be valid but are retained with a date qualifier indicating when they were known to be valid. The GeoKB does not include a comprehensive set of all current/former USGS staff; it only includes those staff who have elected to have certain professional information in a public forum such as by obtaining an ORCiD identifier and making that information public as an author. We also include records for people with ORCID identifiers who are listed as co-authors on USGS publications but who are not USGS staff. Some of these are close, long-term collaborators from academic institution partners.

Caching raw data

The information we are using to build representations of people comes from a couple of different online sources.

  • USGS Staff Profile pages (via a web scraping routine)
  • ORCID records

We are iteratively working through these sources to determine what other information needs to be represented in the GeoKB such as the types of expertise a person claims to have on their profile pages. To help facilitate this process, we are experimenting with a further use of the Item Talk wiki pages where we run a data gathering algorithm and then write a data structure in YAML to the Item Talk pages as a cache to work through with other processing. This gives us a contextual data store (meaning that a person's details pulled from other sources are stored with their representative item in the knowledgebase) that we can pull from to build claims about the person.

Storing raw, yet-to-be-processed content in wiki pages has the added benefit of immediately adding content to the overall search index in the Wikibase instance while we hash through additional reference data mapping. This can be accessed directly using the search form or Mediawiki API. While this is not as specific and transitive across the entire knowledgebase as a SPARQL approach, it does open up possibilities for additional use patterns to leverage larger chunks of text or data structures not yet fully digested into the knowledge representation.

For USGS Staff Profile pages, we transform scraped content into the schema.org standard for Person along with other applicable types. This is a "hoped for" feature that we'd like to see develop eventually - JSON-LD exposed through personal landing pages. The content is not linked to anything at the time of harvest, so we embed name-only values into the JSON-LD rendering (stored as YAML on item talk pages). We then run a secondary process to develop linkages to known entities in the graph when we process claims.

Periodic Updates

The caching process is managed through the statements for a person recording their USGS profile URL (P145 "official website") and ORCID (P106). To aid in managing automated caching tasks, we record several pieces of information in qualifiers on the official website claims:

  • last update (P129) date indicating when we last ran an HTTP operation on the URL
  • status code (P151) indicating if the resource was available or not
  • retrieved (P139) date specifying when we last retrieved information from the foreign resource
PREFIX wd: <https://geokb.wikibase.cloud/entity/>
PREFIX wdt: <https://geokb.wikibase.cloud/prop/direct/>
PREFIX p: <https://geokb.wikibase.cloud/prop/>
PREFIX pq: <https://geokb.wikibase.cloud/prop/qualifier/>

SELECT ?item ?url ?last_update ?retrieved ?status_code 
WHERE {
  ?item wdt:P1 wd:Q3 ;
        wdt:P145 ?url ;
        p:P145 ?url_statement .
  OPTIONAL {
    ?url_statement pq:P129 ?last_update .
  }
  OPTIONAL {
    ?url_statement pq:P139 ?retrieved .
  }
  OPTIONAL {
    ?url_statement pq:P151 ?status_code .
  }
  FILTER(regex(str(?url), "staff-profiles"))
}

Try it!


Previous Address

One of the challenges in working with the staff profiles content comes from how the URL space is being managed. URLs for the same person change and show up over time in inconsistent ways. Sometimes an older URL will get a 301 redirect to a new URL, but other times, both an old and new URL will still return 200 responses. In other cases, old URLs disappear entirely (404 responses), and in other cases these return a 403 "forbidden" response, indicating that something is probably still there but inaccessible anonymously.

We have had to deal with this issue in a variety of ways, not all of which can be automated (yet). After initially recording multiple official website claims, we ended up remodeling the content to record only a single official website for a person and then using a "previous address" qualifier to record additional URLs that may or may not be valid for a person. When a URL stops returning a functional response (200 or 301), we end up needing to investigate at some point. The retrieved date is an indicator of when a person may have separated from the USGS for some reason, and then we have internal information about this that can be consulted.

Analytical Uses

One of the things that bringing our many disparate (meta)data sources together in graph form supports is network analysis in various ways. One of these areas of interest is in our human networks of people working across organizational boundaries. We can see this show up in various ways by examining where people intersect in co-authoring papers, working on projects together, and potentially where we are working on some of the same things but perhaps not collaborating as much as we could be.

Queries

The following is a basic query pattern that can be used to find cases where staff from different Science Centers have collaborated on products. The query generates a graph with labels for the intersection of people co-authoring creative works within a recent time range. It uses query criteria and filters to identify cases where people are working together across organizational units. This particular case examines co-authors who are affiliated with one of the organizational units in the Midcontinent Region.

PREFIX ge: <https://geokb.wikibase.cloud/entity/>
PREFIX gp: <https://geokb.wikibase.cloud/prop/direct/>

SELECT ?author1 ?organization1 ?article ?author2 ?organization2
?author1Label ?organization1Label ?articleLabel ?author2Label ?organization2Label
WHERE {
  ?article gp:P102 ?author1, ?author2 . # Get articles authored by multiple people
  ?author1 gp:P108 ?organization1 . # Get author 1's organizational affiliation
  ?author2 gp:P108 ?organization2 . # Get author 2's organizational affiliation
  ?organization1 gp:P190* ge:Q44363 . # Restrict author 1's organization to subsidiaries of the Midcontinent Region
  ?organization2 gp:P190* ge:Q44363 . # Restrict author 2's organization to subsidiaries of the Midcontinent Region
  ?article gp:P7 ?publication_date . # Get the publication date for the articles
  FILTER (?author1 != ?author2 && ?organization1 != ?organization2 && YEAR(?publication_date) > 2017) # Filter to pubs since 2017 and tease out authors in different organizations
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
}

Try it!


An important note here is that the affiliation information comes mostly from the latest affiliation we know about for a person from a personnel organization standpoint. The linkage comes from the stated organization(s) that a person is a part of from scraping their staff profiles. This is not necessarily fully current and does not reflect the full history of affiliations that a person has had. It is also not based on the stated affiliation for authors at the time of publishing an article or report.

The knowledgebase model we are working with here supports the dynamic of recording many affiliations for a given person with the potential for robust time bounds if that information could be surfaced. But we have challenges in information quality and completeness that have not been overcome. Affiliations for authors are sometimes recorded in the Publications Warehouse but these are essentially name-only identifiers for organizational units, not all of which have been disambiguated and tied to a usable identifier of some kind. What we would want to be able to do in an ideal world for something like this analysis is look specifically at the affiliation of co-authors at the time of co-authoring a publication, but we need more sophistication in our underlying characterization to do so comprehensively.