Main Page

From geokb
Revision as of 13:32, 27 July 2023 by Sky (talk | contribs)

Welcome to the Geoscience Knowledgebase (GeoKB)

Project Intent

The Geoscience Knowledgebase (GeoKB) is an experimental R&D effort in the U.S. Geological Survey, with this particular instance on the Wikibase Cloud our current skunkworks. We're ultimately trying to develop a new way of organizing and encoding all applicable knowledge that we develop through our institution about the earth system in a way that connects to the much broader global knowledgebase. As a government science institution, it's not really our role to muck about in the public commons knowledge-bank (e.g., Wikidata), but at the same time, it is our mandate to fully release (donate) what we know into the public domain for the public good.

What we're experimenting with here is how we can organize and project our data, information, and knowledge in a way that is more connected and more accessible for others to pick up and run with. By putting it all into the very granular structure that Wikibase is built on and leveraging pertinent aspects of the semantics in use in Wikidata and formal ontologies, our hope is that the curatorial pathway from "our knowledge" to "everyone's shared knowledge" is as frictionless as possible.

We are also working on how this knowledgebase resource can be baked into our ongoing geoscientific research as a living tool - building it by using it in practice. Rather than being an afterthought, only contributed to by someone with enough interest at the end of a project, we are seeking to build it as a scientific instrument used directly in research and analysis. By working with this capability to solve important problems in information and knowledge management that are impeding our research practices today, this should result in a much more efficient pathway to usable knowledge projected out for others to take advantage of for their own purposes.

Some of what we bring together in the GeoKB will be originated from our work in USGS, while much of it will come from the many other public data and information sources we consult in our research. We'll be working hard to get things right in terms of references and qualifiers on claims and careful provenance tracing through item and property history and annotation. We'll share the code we use in developing bots to handle as much of this work as possible, starting with this project.

Wherever possible, we will pull directly from and build associations with Wikidata properties and classification items, though we are making judgment calls on where we agree/disagree with the specific semantics. While we may pull whole groups of items from Wikidata through bots, we are being selective in what claims we leverage from Wikidata, focusing on the parts that matter to our work and that we trust sufficiently to use. We'll dig a bit into what other groups are doing in this regard to follow useful conventions so we make our stuff as linkable as possible.

Claims, not facts

The Wikidata/Wikibase approach is particularly interesting in that it is built with a fundamentally simplistic model where an entity is nothing more than a label with alternates and a description along with a local identifier. Entities become something by having statements made about them that are also called claims within the architecture. This is what makes it a knowledge system. The claims are not facts; they are assertions made by their contributors. A user can choose to trust or not trust a particular claim based on whatever is recorded directly with the claim. Important factors in making a judgment about what to trust include:

  • how the property used to make the statement is defined through references and qualifiers (do you trust the source?)
  • who made the assertion and how it was made - personal entry, actions of a bot backed by open and accessible software code, etc.

Entities can also have multiple claims made about the same thing - multiple values for the same property. This can be a way of recording competing claims about the same phenomenon, which is a common reality in data integration work. Different sources of information about the same entities such as mining projects can have differing information about the entities. Together they may provide a more thorough picture of the whole, and/or one source can prove to be more trustworthy than another.

The needs of the user or purpose of a query can also drive the selection of appropriate claims that are most fit for purpose. By surfacing as many claims as possible and assembling them into a common graph, we can give users the power to make judgment calls about what's most appropriate for their use. The SPARQL query that is executed encapsulates the logic, and users can explain their reasoning in whatever venue the query is shared (e.g., code notebook, research paper, etc.).

Too often, these judgments are made in the background without a sufficient record describing the reasoning. We have thousands of examples of individual scientific datasets developed this way without granular enough information for the next scientist to make their own reasoned decisions upon. We're interested in how the knowledgebase approach helps us correct that dynamic in our work.

Knowledgebase Development

We are taking advantage of building the GeoKB within the Wikibase/Wikimedia tech platform to use other features available here for development. Notable are the discussion pages on items and properties where we can capture development work centered on those concepts. This is all very much a work in progress. While we have some formal ontologies and models to draw from, much of the content we are working to organize here is based in other types of discrete and not particularly well connected datasets or less structured forms.

A central point of discussion and directory off to other parts of the framework where we are developing concepts and structure is found on the SPARQL examples page. Visit that page for a description of the major entity types, example queries associated with those entities, and links to deeper level details on the technical development.

Partnering

In USGS, we are committed to the goals and ideals of open science and are working to improve our practices in line with those principles. We don't as yet have a formal route to collaborate on this particular project, but we're working to introduce it and bring it into the community for collaboration via the Earth Science Information Partners (ESIP). Look for a presentation on this work at the ESIP Summer Meeting in Burlington, VT in July, 2023.

A note about something called iSAID

An earlier effort in building a graph-based knowledgebase structure was something we called iSAID. This was a somewhat tongue-in-cheek name that stands for the Integrated Science Assessment Information Database. iSAID was motivated by the need to comprehensively look across the USGS at our entire scientific portfolio to characterize and understand our capacity to pursue new scientific objectives. The nodes in the iSAID graph include people, organizations, projects, publications, datasets, and models. iSAID is still a going concern, and we're now working on bringing the public aspect of iSAID (which is about 90% or more) into the Wikibase model here. We need all of those same entities represented in the knowledgebase for the other use cases we are pursuing.

Disclaimer

This is an experimental effort that will only deal with an organization and portrayal of already publicly available data, information, and knowledge. Everything we incorporate here, directly attributable to the USGS, is from officially released, peer reviewed material. Our pathway from institutional knowledge development processes to releasable data, software, and knowledge products is guided by policies in our Fundamental Science Practices. How these policies are applied to what is a very different form in an integrated knowledgebase is part of our experimentation and development work.