Anonymous

Item talk:Q3: Difference between revisions

From geokb
1,666 bytes added ,  6 months ago
Line 3: Line 3:
= Caching raw data =
= Caching raw data =
The information we are using to build representations of people comes from a couple of different online sources.
The information we are using to build representations of people comes from a couple of different online sources.
* USGS Personnel Profile pages (via a web scraping routine)
* USGS Staff Profile pages (via a web scraping routine)
* ORCID records
* ORCID records


Line 9: Line 9:


Storing raw, yet-to-be-processed content in wiki pages has the added benefit of immediately adding content to the overall search index in the Wikibase instance while we hash through additional reference data mapping. This can be accessed directly using the search form or Mediawiki API. While this is not as specific and transitive across the entire knowledgebase as a SPARQL approach, it does open up possibilities for additional use patterns to leverage larger chunks of text or data structures not yet fully digested into the knowledge representation.
Storing raw, yet-to-be-processed content in wiki pages has the added benefit of immediately adding content to the overall search index in the Wikibase instance while we hash through additional reference data mapping. This can be accessed directly using the search form or Mediawiki API. While this is not as specific and transitive across the entire knowledgebase as a SPARQL approach, it does open up possibilities for additional use patterns to leverage larger chunks of text or data structures not yet fully digested into the knowledge representation.
For USGS Staff Profile pages, we transform scraped content into the schema.org standard for [https://schema.org/Person Person] along with other applicable types. This is a "hoped for" feature that we'd like to see develop eventually - JSON-LD exposed through personal landing pages. The content is not linked to anything at the time of harvest, so we embed name-only values into the JSON-LD rendering (stored as YAML on item talk pages). We then run a secondary process to develop linkages to known entities in the graph when we process claims.


== Periodic Updates ==
== Periodic Updates ==
The caching process is managed through the statements for a person recording their USGS profile URL (P31 "reference URL") and ORCID (P106). To aid in managing automated caching tasks, we record a retrieved (P139) date specifying when we last retrieved information from the foreign resource and a status code (P151) indicating if the resource was available or not. This information is also in the "meta" section of the cached YAML structure, but we place it into the source claims so that we can query on it using something like the following SPARQL.
The caching process is managed through the statements for a person recording their USGS profile URL (P145 "official website") and ORCID (P106). To aid in managing automated caching tasks, we record several pieces of information in qualifiers on the official website claims:
* last update (P129) date indicating when we last ran an HTTP operation on the URL
* status code (P151) indicating if the resource was available or not
* retrieved (P139) date specifying when we last retrieved information from the foreign resource


<sparql tryit="1">
<sparql tryit="1">
Line 17: Line 22:
PREFIX wdt: <https://geokb.wikibase.cloud/prop/direct/>
PREFIX wdt: <https://geokb.wikibase.cloud/prop/direct/>
PREFIX p: <https://geokb.wikibase.cloud/prop/>
PREFIX p: <https://geokb.wikibase.cloud/prop/>
PREFIX ps: <https://geokb.wikibase.cloud/prop/statement/>
PREFIX pq: <https://geokb.wikibase.cloud/prop/qualifier/>
PREFIX pq: <https://geokb.wikibase.cloud/prop/qualifier/>


SELECT ?item ?itemLabel ?profile_url ?retrieved ?status_code
SELECT ?item ?url ?last_update ?retrieved ?status_code  
WHERE {
WHERE {
   ?item wdt:P1 wd:Q3 ;
   ?item wdt:P1 wd:Q3 ;
         wdt:P31 ?profile_url ;
         wdt:P145 ?url ;
         p:P31 ?ref_url_statement .
         p:P145 ?url_statement .
  OPTIONAL {
    ?url_statement pq:P129 ?last_update .
  }
   OPTIONAL {
   OPTIONAL {
     ?ref_url_statement pq:P151 ?status_code ;
     ?url_statement pq:P139 ?retrieved .
                      pq:P139 ?retrieved .
   }
   }
   SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
   OPTIONAL {
    ?url_statement pq:P151 ?status_code .
  }
  FILTER(regex(str(?url), "staff-profiles"))
}
}
</sparql>
</sparql>
== Previous Address ==
One of the challenges in working with the staff profiles content comes from how the URL space is being managed. URLs for the same person change and show up over time in inconsistent ways. Sometimes an older URL will get a 301 redirect to a new URL, but other times, both an old and new URL will still return 200 responses. In other cases, old URLs disappear entirely (404 responses), and in other cases these return a 403 "forbidden" response, indicating that something is probably still there but inaccessible anonymously.
We have had to deal with this issue in a variety of ways, not all of which can be automated (yet). After initially recording multiple official website claims, we ended up remodeling the content to record only a single official website for a person and then using a "previous address" qualifier to record additional URLs that may or may not be valid for a person. When a URL stops returning a functional response (200 or 301), we end up needing to investigate at some point. The retrieved date is an indicator of when a person may have separated from the USGS for some reason, and then we have internal information about this that can be consulted.


= Analytical Uses =
= Analytical Uses =