Item talk:Q10: Difference between revisions

From geokb
 
(6 intermediate revisions by the same user not shown)
Line 1: Line 1:
= Overview =
The National Instrument 43-101 Technical Reports are a special type of document required by the Canadian Securities Administrators for mining companies they regulate. A similar type of report called S-K 1300 is required by the U.S. Securities and Exchange Commission. Both sets of documents are being represented in the GeoKB as important scientific reference material, providing data and information on mining production history and geoscientific context (mineral deposit types, etc.). The GeoKB representation serves as a linking resource to data and information identified in or extracted from the reports through various types of processes.
The National Instrument 43-101 Technical Reports are a special type of document required by the Canadian Securities Administrators for mining companies they regulate. A similar type of report called S-K 1300 is required by the U.S. Securities and Exchange Commission. These will also be brought into the GeoKB at a future date.


These reports contain useful information in conducting mineral resource assessments in the USGS such as prospecting history for a mining operation. We represent them here in the GeoKB in order to serve as a linking resource to data and information identified in or extracted from the reports through various types of processes. The USGS maintains a collection of metadata and PDF file content for NI 43-101 reports accessed over time in a Zotero Group Library where the metadata are available without authentication, but the report contents require authentication and authorization based on copyright notices included in some reports. Having the metadata online and linkable provides us with a vital point of citation reference as these reports are not otherwise accessible.
= Source Material =
The digital source material for the reports is maintained in a [https://www.sciencebase.gov/catalog/item/6618596fd34e7eb9eb7d7b7c restricted ScienceBase collection]. Bibliographic metadata are organized into a [https://www.zotero.org/groups/4530692/usgs_ni_43-101_reports/library Zotero group library] that serves publishing scientists as a reference library resource and provides public "landing pages" that are linked to from citations. Permanent links to both metadata and file content are facilitated through [https://github.com/perma-id/w3id.org/tree/master/usgs w3id.org rewrite configurations] that are recorded with GeoKB entities.
 
= Source Processing =
The process of developing the GeoKB representation for document items is continuing to evolve. The intent is to leverage the GeoKB knowledge graph capability to organize the results of AI-assisted linked open data development, taking the results of graph fragments generated by Retrieval Augmented Generation (RAG) pipelines and recording those as part of the overall knowledge graph for query and analysis.
 
We are also developing a schema.org representation in JSON-LD as the base metadata store for the reports. These will be stored as files in the source repository in ScienceBase and will be recorded with the GeoKB items as YAML on item talk pages, replacing a previous experiment with the Zotero JSON structure for both metadata and attachments. In this way, any content from full metadata records not yet organized into the knowledge graph will be accessible via full text search in the Wikibase instance.


= Schema =
= Schema =
We treat the Zotero Group Library as our master source for these records currently, knowing that we may one day move to some other technology. We use code to pull the parts of metadata from Zotero we need for the knowledge graph representation with a few transformations to add functionality not available in Zotero. The following are specific notes on what the core statements/properties for these records mean and how they can be used:
We treat the ScienceBase collection as our master source for (meta)data currently. We are evolving our use of the schema.org specifications as a standardized way for encoding metadata details. Schema.org is natively linked open data in fundamentally knowledge graph form, linking identifiers between the document and other entities (authors, organizations, subject matters addressed). Some of these entities are more important than others in terms of the GeoKB representation. For instance, we "care" about authorship from the standpoint of bibliographic metadata for citations, but we do not need authors of these reports instantiated as entities in the graph. We do, however, care more about subject matters such as mining property names, locations, and mineral commodities produced from the standpoint of representation and linking in the knowledge graph.


== metadata URL (P141) ==
The following are specific notes on the mapping from schema.org source to GeoKB properties and pools of entities:
This is perhaps the most crucial identifier to use in any external reference or processing with these records. These are the "evergreen URLs" assigned for the items using the W3ID.org system. W3ID uses a simple rewrite mechanism, and we established a namespace rule for URLs that lets these URLs operate somewhat like a DOI in practical terms. The URL path includes a combination of the Zotero Group Library identifier and the metadata item identifier. If we move to different technology in future, we will retain these identifiers and continue to resolve them to whatever new access point we have in place.


Included with these claims are mime type qualifiers indicating that the URL can be used to send a user to what amounts to a landing page view within the Zotero user interface. The URLs will also respond to content negotiation with application/json as an accept header to return the raw JSON content of full metadata for the item.
== label, description, and aliases ==
Labels for NI 43-101 report items are a work in progress, with the intent to ultimately use the official title of the report extracted through document processing. The initial set of labels were generated using a few details from the file naming scheme employed when the files were being managed on local/network file stores. The descriptions will become increasingly important as additional context in the larger knowledge graph, viewed alongside labels that may or may not always indicate that an item is one of these types of reports. We will follow a standard convention that may move some of the basic descriptive details to descriptions. Aliases are currently unused but could contain other details such as a mine/project name that might be useful from a search/discovery standpoint.


== content URL (P136) ==
== instance of (P1) and classification ==
This statement contains a generated URL that is made up of the Group Library ID, the item key, and the attachment key. It links to browser-based PDF viewer that is only available for a user that has authenticated in a browser session (as it will not prompt for authentication). It is included as a convenience to let users move between working with the GeoKB interface in some way and viewing a document.
All report items are declared as an instance of this item (Q10). The following query will retrieve the full classification of these items:


== Zotero Attachment Key (P143) ==
<sparql tryit="1">
This is the unique key for the file content associated with a metadata item. In the case of this particular collection, there is only a single PDF file attachment for each report. Other collections may have more than one or different types of attachments.
PREFIX wd: <https://geokb.wikibase.cloud/entity/>
PREFIX wdt: <https://geokb.wikibase.cloud/prop/direct/>


One use of the attachment key is with the pyzotero package where it can be used to dump a file to disc.
SELECT ?item ?itemLabel ?subclassOf ?subclassOfLabel
WHERE {
  wd:Q10 wdt:P2* ?item .
  ?item wdt:P2 ?subclassOf .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
}
</sparql>


<pre>
== metadata URL (P141) ==
zot.dump(
This is perhaps the most crucial identifier to use in any external reference or processing with these records. These are the "evergreen URLs" assigned for the items using the W3ID.org system. W3ID uses a simple rewrite mechanism, and we established a namespace rule for URLs that lets these URLs operate somewhat like a DOI in practical terms. The URL path includes a combination of the Zotero Group Library identifier and the metadata item identifier. If we move to different technology in future, we will retain these identifiers and continue to resolve them to whatever new access point we have in place.
    attachment_key,
    filename='some_file_name.pdf',
    path='./'
)
</pre>


The "zot" object here represents an authenticated and authorized Zotero API connection using pyzotero where an API key provided by the USGS or created by a user with read authorization is used to establish a connection to the library.
Included with these claims are mime type qualifiers indicating that the URL can be used to send a user to what amounts to a landing page view within the Zotero user interface. The URLs will also respond to content negotiation with application/json as an accept header to return the raw JSON content of full metadata for the item.
 
These claims include the file size in bytes of the attachment as a "data size" qualifer.


== Zotero Version Number (P142) ==
== data archived at (P200) ==
This is a specialized integer value indicating the version of the metadata used to create or update the entity in the GeoKB. It is used internally for synchronization purposes with the base store in Zotero.
This statement contains a w3id.org form of the ScienceBase identifier that serves as the logical repository for one or more PDF files associated with a given report, schema.org metadata (in JSON-LD) associated with the report, and any derivatives created through processing of the original content.


== GDDID (P93) ==
== GDDID (P93) ==
Line 47: Line 52:
PREFIX pq: <https://geokb.wikibase.cloud/prop/qualifier/>
PREFIX pq: <https://geokb.wikibase.cloud/prop/qualifier/>


SELECT ?report ?reportLabel ?meta_url ?content_url ?attachment_key ?file_size
SELECT ?report ?reportLabel ?meta_url ?content_url ?attachment_key ?mime_type ?checksum ?gddid
WHERE {
WHERE {
   ?report wdt:P1 wd:Q10 ; # instance of NI 43-101 Technical Report
   ?report wdt:P1 wd:Q10 ; # instance of NI 43-101 Technical Report
Line 53: Line 58:
           wdt:P136 ?content_url ; # read URL to attachment content
           wdt:P136 ?content_url ; # read URL to attachment content
           wdt:P143 ?attachment_key ; # attachment key used to download PDF file content
           wdt:P143 ?attachment_key ; # attachment key used to download PDF file content
           p:P143 ?attachment_key_statement .
           p:P136 ?content_url_statement .
   ?attachment_key_statement pq:P144 ?file_size . # attachment file size
   ?content_url_statement pq:P65 ?mime_type ; # mime type of attachment
                        pq:P197 ?checksum . # MD5 checksum for the attachment
  OPTIONAL {
    ?report wdt:P93 ?gddid . # unique ID for the xDD cyberinfrastructure
  }
   SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
   SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
}
}

Latest revision as of 14:18, 6 May 2024

The National Instrument 43-101 Technical Reports are a special type of document required by the Canadian Securities Administrators for mining companies they regulate. A similar type of report called S-K 1300 is required by the U.S. Securities and Exchange Commission. Both sets of documents are being represented in the GeoKB as important scientific reference material, providing data and information on mining production history and geoscientific context (mineral deposit types, etc.). The GeoKB representation serves as a linking resource to data and information identified in or extracted from the reports through various types of processes.

Source Material

The digital source material for the reports is maintained in a restricted ScienceBase collection. Bibliographic metadata are organized into a Zotero group library that serves publishing scientists as a reference library resource and provides public "landing pages" that are linked to from citations. Permanent links to both metadata and file content are facilitated through w3id.org rewrite configurations that are recorded with GeoKB entities.

Source Processing

The process of developing the GeoKB representation for document items is continuing to evolve. The intent is to leverage the GeoKB knowledge graph capability to organize the results of AI-assisted linked open data development, taking the results of graph fragments generated by Retrieval Augmented Generation (RAG) pipelines and recording those as part of the overall knowledge graph for query and analysis.

We are also developing a schema.org representation in JSON-LD as the base metadata store for the reports. These will be stored as files in the source repository in ScienceBase and will be recorded with the GeoKB items as YAML on item talk pages, replacing a previous experiment with the Zotero JSON structure for both metadata and attachments. In this way, any content from full metadata records not yet organized into the knowledge graph will be accessible via full text search in the Wikibase instance.

Schema

We treat the ScienceBase collection as our master source for (meta)data currently. We are evolving our use of the schema.org specifications as a standardized way for encoding metadata details. Schema.org is natively linked open data in fundamentally knowledge graph form, linking identifiers between the document and other entities (authors, organizations, subject matters addressed). Some of these entities are more important than others in terms of the GeoKB representation. For instance, we "care" about authorship from the standpoint of bibliographic metadata for citations, but we do not need authors of these reports instantiated as entities in the graph. We do, however, care more about subject matters such as mining property names, locations, and mineral commodities produced from the standpoint of representation and linking in the knowledge graph.

The following are specific notes on the mapping from schema.org source to GeoKB properties and pools of entities:

label, description, and aliases

Labels for NI 43-101 report items are a work in progress, with the intent to ultimately use the official title of the report extracted through document processing. The initial set of labels were generated using a few details from the file naming scheme employed when the files were being managed on local/network file stores. The descriptions will become increasingly important as additional context in the larger knowledge graph, viewed alongside labels that may or may not always indicate that an item is one of these types of reports. We will follow a standard convention that may move some of the basic descriptive details to descriptions. Aliases are currently unused but could contain other details such as a mine/project name that might be useful from a search/discovery standpoint.

instance of (P1) and classification

All report items are declared as an instance of this item (Q10). The following query will retrieve the full classification of these items:

PREFIX wd: <https://geokb.wikibase.cloud/entity/>
PREFIX wdt: <https://geokb.wikibase.cloud/prop/direct/>

SELECT ?item ?itemLabel ?subclassOf ?subclassOfLabel
WHERE {
  wd:Q10 wdt:P2* ?item .
  ?item wdt:P2 ?subclassOf .
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
}

Try it!


metadata URL (P141)

This is perhaps the most crucial identifier to use in any external reference or processing with these records. These are the "evergreen URLs" assigned for the items using the W3ID.org system. W3ID uses a simple rewrite mechanism, and we established a namespace rule for URLs that lets these URLs operate somewhat like a DOI in practical terms. The URL path includes a combination of the Zotero Group Library identifier and the metadata item identifier. If we move to different technology in future, we will retain these identifiers and continue to resolve them to whatever new access point we have in place.

Included with these claims are mime type qualifiers indicating that the URL can be used to send a user to what amounts to a landing page view within the Zotero user interface. The URLs will also respond to content negotiation with application/json as an accept header to return the raw JSON content of full metadata for the item.

data archived at (P200)

This statement contains a w3id.org form of the ScienceBase identifier that serves as the logical repository for one or more PDF files associated with a given report, schema.org metadata (in JSON-LD) associated with the report, and any derivatives created through processing of the original content.

GDDID (P93)

This is the unique identifier for the document (article) as processed through and represented in the xDD (GeoDeepDive) cyberinfrastructure. This includes indexing of the full content, identification of key facets (concepts) in the documents, document segmentation through the COSMOS engine, and value-added processing pipelines. These can be used to interface with xDD APIs and other aspects of the infrastructure as needed. Not all NI 43-101 items will have these identifiers available.

Example Query

The following query can be modified as needed to pull records for external processing or other uses. If you don't actually need the title, remove the ?reportLabel field and the SERVICE statement as this will improve query performance. There should not be a problem eliminating the LIMIT and pulling all records, but you may also iterate using LIMIT and OFFSET (standard SPARQL methods).

PREFIX wd: <https://geokb.wikibase.cloud/entity/>
PREFIX wdt: <https://geokb.wikibase.cloud/prop/direct/>
PREFIX p: <https://geokb.wikibase.cloud/prop/>
PREFIX pq: <https://geokb.wikibase.cloud/prop/qualifier/>

SELECT ?report ?reportLabel ?meta_url ?content_url ?attachment_key ?mime_type ?checksum ?gddid
WHERE {
  ?report wdt:P1 wd:Q10 ; # instance of NI 43-101 Technical Report
          wdt:P141 ?meta_url ; # permanent URL to the online representation
          wdt:P136 ?content_url ; # read URL to attachment content
          wdt:P143 ?attachment_key ; # attachment key used to download PDF file content
          p:P136 ?content_url_statement .
  ?content_url_statement pq:P65 ?mime_type ; # mime type of attachment
                         pq:P197 ?checksum . # MD5 checksum for the attachment
  OPTIONAL {
    ?report wdt:P93 ?gddid . # unique ID for the xDD cyberinfrastructure
  }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
}
LIMIT 100

Try it!


Note on file naming/identifiers

If an operator wants to take these documents into another environment for processing through some form of segmentation, vectorization, and other AI methods, it is important to retain a connection to an identifier that will let us connect extracted/derived information and data back to our core records. The metadata URL supplied will do that with the attachment key an optional element that could also be important in tracking to the specific PDF file used. One approach would be to pull all three keys for use in file naming or whatever file system catalog might be used:

E.g., <library ID>_<item key>_<attachment key>.pdf

Alternatively, treat the metadata URL like a DOI identifier on other types of publications, and incorporate that into processing such that any derivatives built from these records retain the connection to source.