Item talk:Q10

From geokb
Revision as of 23:29, 29 August 2023 by Sky (talk | contribs) (→‎Schema)

Overview

The National Instrument 43-101 Technical Reports are a special type of document required by the Canadian Securities Administrators for mining companies they regulate. A similar type of report called S-K 1300 is required by the U.S. Securities and Exchange Commission. These will also be brought into the GeoKB at a future date.

These reports contain useful information in conducting mineral resource assessments in the USGS such as prospecting history for a mining operation. We represent them here in the GeoKB in order to serve as a linking resource to data and information identified in or extracted from the reports through various types of processes. The USGS maintains a collection of metadata and PDF file content for NI 43-101 reports accessed over time in a Zotero Group Library where the metadata are available without authentication, but the report contents require authentication and authorization based on copyright notices included in some reports. Having the metadata online and linkable provides us with a vital point of citation reference as these reports are not otherwise accessible.

Schema

We treat the Zotero Group Library as our master source for these records currently, knowing that we may one day move to some other technology. We use code to pull the parts of metadata from Zotero we need for the knowledge graph representation with a few transformations to add functionality not available in Zotero. The following are specific notes on what the core statements/properties for these records mean and how they can be used:

metadata URL (P141)

This is perhaps the most crucial identifier to use in any external reference or processing with these records. These are the "evergreen URLs" assigned for the items using the W3ID.org system. W3ID uses a simple rewrite mechanism, and we established a namespace rule for URLs that lets these URLs operate somewhat like a DOI in practical terms. The URL path includes a combination of the Zotero Group Library identifier and the metadata item identifier. If we move to different technology in future, we will retain these identifiers and continue to resolve them to whatever new access point we have in place.

Included with these claims are mime type qualifiers indicating that the URL can be used to send a user to what amounts to a landing page view within the Zotero user interface. The URLs will also respond to content negotiation with application/json as an accept header to return the raw JSON content of full metadata for the item.

content URL (P136)

This statement contains a generated URL that is made up of the Group Library ID, the item key, and the attachment key. It links to browser-based PDF viewer that is only available for a user that has authenticated in a browser session (as it will not prompt for authentication). It is included as a convenience to let users move between working with the GeoKB interface in some way and viewing a document.

Zotero Attachment Key (P143)

This is the unique key for the file content associated with a metadata item. In the case of this particular collection, there is only a single PDF file attachment for each report. Other collections may have more than one or different types of attachments.

One use of the attachment key is with the pyzotero package where it can be used to dump a file to disc.

zot.dump(
    attachment_key,
    filename='some_file_name.pdf',
    path='./'
)

The "zot" object here represents an authenticated and authorized Zotero API connection using pyzotero where an API key provided by the USGS or created by a user with read authorization is used to establish a connection to the library.

These claims include the file size in bytes of the attachment as a "data size" qualifer.

Zotero Version Number (P142)

This is a specialized integer value indicating the version of the metadata used to create or update the entity in the GeoKB. It is used internally for synchronization purposes with the base store in Zotero.

GDDID (P93)

This is the unique identifier for the document (article) as processed through and represented in the xDD (GeoDeepDive) cyberinfrastructure. This includes indexing of the full content, identification of key facets (concepts) in the documents, document segmentation through the COSMOS engine, and value-added processing pipelines. These can be used to interface with xDD APIs and other aspects of the infrastructure as needed. Not all NI 43-101 items will have these identifiers available.

Example Query

The following query can be modified as needed to pull records for external processing or other uses. If you don't actually need the title, remove the ?reportLabel field and the SERVICE statement as this will improve query performance. There should not be a problem eliminating the LIMIT and pulling all records, but you may also iterate using LIMIT and OFFSET (standard SPARQL methods).

PREFIX wd: <https://geokb.wikibase.cloud/entity/>
PREFIX wdt: <https://geokb.wikibase.cloud/prop/direct/>
PREFIX p: <https://geokb.wikibase.cloud/prop/>
PREFIX pq: <https://geokb.wikibase.cloud/prop/qualifier/>

SELECT ?report ?reportLabel ?meta_url ?content_url ?attachment_key ?file_size
WHERE {
  ?report wdt:P1 wd:Q10 ; # instance of NI 43-101 Technical Report
          wdt:P141 ?meta_url ; # permanent URL to the online representation
          wdt:P136 ?content_url ; # read URL to attachment content
          wdt:P143 ?attachment_key ; # attachment key used to download PDF file content
          p:P143 ?attachment_key_statement .
  ?attachment_key_statement pq:P144 ?file_size . # attachment file size
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
}
LIMIT 100

Try it!


Note on file naming/identifiers

If an operator wants to take these documents into another environment for processing through some form of segmentation, vectorization, and other AI methods, it is important to retain a connection to an identifier that will let us connect extracted/derived information and data back to our core records. The metadata URL supplied will do that with the attachment key an optional element that could also be important in tracking to the specific PDF file used. One approach would be to pull all three keys for use in file naming or whatever file system catalog might be used:

E.g., <library ID>_<item key>_<attachment key>.pdf

Alternatively, treat the metadata URL like a DOI identifier on other types of publications, and incorporate that into processing such that any derivatives built from these records retain the connection to source.