Item talk:Q10

From geokb
Revision as of 14:18, 6 May 2024 by Sky (talk | contribs) (→‎instance of (P1))
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

The National Instrument 43-101 Technical Reports are a special type of document required by the Canadian Securities Administrators for mining companies they regulate. A similar type of report called S-K 1300 is required by the U.S. Securities and Exchange Commission. Both sets of documents are being represented in the GeoKB as important scientific reference material, providing data and information on mining production history and geoscientific context (mineral deposit types, etc.). The GeoKB representation serves as a linking resource to data and information identified in or extracted from the reports through various types of processes.

Source Material

The digital source material for the reports is maintained in a restricted ScienceBase collection. Bibliographic metadata are organized into a Zotero group library that serves publishing scientists as a reference library resource and provides public "landing pages" that are linked to from citations. Permanent links to both metadata and file content are facilitated through w3id.org rewrite configurations that are recorded with GeoKB entities.

Source Processing

The process of developing the GeoKB representation for document items is continuing to evolve. The intent is to leverage the GeoKB knowledge graph capability to organize the results of AI-assisted linked open data development, taking the results of graph fragments generated by Retrieval Augmented Generation (RAG) pipelines and recording those as part of the overall knowledge graph for query and analysis.

We are also developing a schema.org representation in JSON-LD as the base metadata store for the reports. These will be stored as files in the source repository in ScienceBase and will be recorded with the GeoKB items as YAML on item talk pages, replacing a previous experiment with the Zotero JSON structure for both metadata and attachments. In this way, any content from full metadata records not yet organized into the knowledge graph will be accessible via full text search in the Wikibase instance.

Schema

We treat the ScienceBase collection as our master source for (meta)data currently. We are evolving our use of the schema.org specifications as a standardized way for encoding metadata details. Schema.org is natively linked open data in fundamentally knowledge graph form, linking identifiers between the document and other entities (authors, organizations, subject matters addressed). Some of these entities are more important than others in terms of the GeoKB representation. For instance, we "care" about authorship from the standpoint of bibliographic metadata for citations, but we do not need authors of these reports instantiated as entities in the graph. We do, however, care more about subject matters such as mining property names, locations, and mineral commodities produced from the standpoint of representation and linking in the knowledge graph.

The following are specific notes on the mapping from schema.org source to GeoKB properties and pools of entities:

label, description, and aliases

Labels for NI 43-101 report items are a work in progress, with the intent to ultimately use the official title of the report extracted through document processing. The initial set of labels were generated using a few details from the file naming scheme employed when the files were being managed on local/network file stores. The descriptions will become increasingly important as additional context in the larger knowledge graph, viewed alongside labels that may or may not always indicate that an item is one of these types of reports. We will follow a standard convention that may move some of the basic descriptive details to descriptions. Aliases are currently unused but could contain other details such as a mine/project name that might be useful from a search/discovery standpoint.

instance of (P1) and classification

All report items are declared as an instance of this item (Q10). The following query will retrieve the full classification of these items:

PREFIX wd: <https://geokb.wikibase.cloud/entity/>
PREFIX wdt: <https://geokb.wikibase.cloud/prop/direct/>

SELECT ?item ?itemLabel ?subclassOf ?subclassOfLabel
WHERE {
  wd:Q10 wdt:P2* ?item .
  ?item wdt:P2 ?subclassOf .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
}

Try it!


metadata URL (P141)

This is perhaps the most crucial identifier to use in any external reference or processing with these records. These are the "evergreen URLs" assigned for the items using the W3ID.org system. W3ID uses a simple rewrite mechanism, and we established a namespace rule for URLs that lets these URLs operate somewhat like a DOI in practical terms. The URL path includes a combination of the Zotero Group Library identifier and the metadata item identifier. If we move to different technology in future, we will retain these identifiers and continue to resolve them to whatever new access point we have in place.

Included with these claims are mime type qualifiers indicating that the URL can be used to send a user to what amounts to a landing page view within the Zotero user interface. The URLs will also respond to content negotiation with application/json as an accept header to return the raw JSON content of full metadata for the item.

data archived at (P200)

This statement contains a w3id.org form of the ScienceBase identifier that serves as the logical repository for one or more PDF files associated with a given report, schema.org metadata (in JSON-LD) associated with the report, and any derivatives created through processing of the original content.

GDDID (P93)

This is the unique identifier for the document (article) as processed through and represented in the xDD (GeoDeepDive) cyberinfrastructure. This includes indexing of the full content, identification of key facets (concepts) in the documents, document segmentation through the COSMOS engine, and value-added processing pipelines. These can be used to interface with xDD APIs and other aspects of the infrastructure as needed. Not all NI 43-101 items will have these identifiers available.

Example Query

The following query can be modified as needed to pull records for external processing or other uses. If you don't actually need the title, remove the ?reportLabel field and the SERVICE statement as this will improve query performance. There should not be a problem eliminating the LIMIT and pulling all records, but you may also iterate using LIMIT and OFFSET (standard SPARQL methods).

PREFIX wd: <https://geokb.wikibase.cloud/entity/>
PREFIX wdt: <https://geokb.wikibase.cloud/prop/direct/>
PREFIX p: <https://geokb.wikibase.cloud/prop/>
PREFIX pq: <https://geokb.wikibase.cloud/prop/qualifier/>

SELECT ?report ?reportLabel ?meta_url ?content_url ?attachment_key ?mime_type ?checksum ?gddid
WHERE {
  ?report wdt:P1 wd:Q10 ; # instance of NI 43-101 Technical Report
          wdt:P141 ?meta_url ; # permanent URL to the online representation
          wdt:P136 ?content_url ; # read URL to attachment content
          wdt:P143 ?attachment_key ; # attachment key used to download PDF file content
          p:P136 ?content_url_statement .
  ?content_url_statement pq:P65 ?mime_type ; # mime type of attachment
                         pq:P197 ?checksum . # MD5 checksum for the attachment
  OPTIONAL {
    ?report wdt:P93 ?gddid . # unique ID for the xDD cyberinfrastructure
  }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
}
LIMIT 100

Try it!


Note on file naming/identifiers

If an operator wants to take these documents into another environment for processing through some form of segmentation, vectorization, and other AI methods, it is important to retain a connection to an identifier that will let us connect extracted/derived information and data back to our core records. The metadata URL supplied will do that with the attachment key an optional element that could also be important in tracking to the specific PDF file used. One approach would be to pull all three keys for use in file naming or whatever file system catalog might be used:

E.g., <library ID>_<item key>_<attachment key>.pdf

Alternatively, treat the metadata URL like a DOI identifier on other types of publications, and incorporate that into processing such that any derivatives built from these records retain the connection to source.