About

From geokb

Welcome to the Geoscience Knowledgebase (GeoKB)

Project Intent

The Geoscience Knowledgebase (GeoKB) is an experimental R&D effort in the U.S. Geological Survey, with this particular instance on the Wikibase Cloud our current skunkworks. We're ultimately trying to develop a new way of organizing and encoding all applicable knowledge that we develop through our institution about the earth system in a way that connects to the much broader global knowledgebase. As a government science institution, it's not really our role to muck about in the public commons knowledge-bank (e.g., Wikidata), but at the same time, it is our mandate to fully release (donate) what we know into the public domain for the public good.

What we're experimenting with here is how we can organize and project our data, information, and knowledge in a way that is more connected and more accessible for others to pick up and run with. By putting it all into the very granular structure that Wikibase is built on and leveraging pertinent aspects of the semantics in use in Wikidata and formal ontologies, our hope is that the curatorial pathway from "our knowledge" to "everyone's shared knowledge" is as frictionless as possible.

We are also working on how this knowledgebase resource can be baked into our ongoing geoscientific research as a digital scientific instrument - building it by using it in practice. Rather than being an afterthought, only contributed to by someone with enough interest at the end of a project, we are seeking to build it as a scientific instrument used directly in research and analysis. By working with this capability to solve important problems in information and knowledge management that are impeding our research practices today, this should result in a much more efficient pathway to usable knowledge projected out for others to take advantage of for their own purposes.

Some of what we bring together in the GeoKB will be originated from our work in USGS, while much of it will come from the many other public data and information sources we consult in our research. We'll be working hard to get things right in terms of references and qualifiers on claims and careful provenance tracing through item and property history and annotation. We'll share the code we use in developing bots to handle as much of this work as possible, starting with this project.

Wherever possible, we will pull directly from and build associations with Wikidata properties and classification items, though we are making judgment calls on where we agree/disagree with the specific semantics. While we may pull whole groups of items from Wikidata through bots, we are being selective in what claims we leverage from Wikidata, focusing on the parts that matter to our work and that we trust sufficiently to use. We'll dig a bit into what other groups are doing in this regard to follow useful conventions so we make our stuff as linkable as possible.

Claims, not facts

The Wikidata/Wikibase approach is particularly interesting in that it is built with a fundamentally simplistic model where an entity is nothing more than a label with alternates and a description along with a local identifier. Entities become something by having statements made about them that are also called claims within the architecture. This is what makes it a knowledge system. The claims are not facts; they are assertions made by their contributors. A user can choose to trust or not trust a particular claim based on whatever is recorded directly with the claim. Important factors in making a judgment about what to trust include:

  • how the property used to make the statement is defined through references and qualifiers (do you trust the source?)
  • who made the assertion and how it was made - personal entry, actions of a bot backed by open and accessible software code, etc.

Entities can also have multiple claims made about the same thing - multiple values for the same property. This can be a way of recording competing claims about the same phenomenon, which is a common reality in data integration work. Different sources of information about the same entities such as mining projects can have differing information about the entities. Together they may provide a more thorough picture of the whole, and/or one source can prove to be more trustworthy than another.

The needs of the user or purpose of a query can also drive the selection of appropriate claims that are most fit for purpose. By surfacing as many claims as possible and assembling them into a common graph, we can give users the power to make judgment calls about what's most appropriate for their use. The SPARQL query that is executed encapsulates the logic, and users can explain their reasoning in whatever venue the query is shared (e.g., code notebook, research paper, etc.).

Too often, these judgments are made in the background without a sufficient record describing the reasoning. We have thousands of examples of individual scientific datasets developed this way without granular enough information for the next scientist to make their own reasoned decisions upon. We're interested in how the knowledgebase approach helps us correct that dynamic in our work.

Knowledgebase Development

We are taking advantage of building the GeoKB within the Wikibase/Wikimedia tech platform to use other features available here for development. Notable are the discussion pages on items and properties where we can capture development work centered on those concepts. This is all very much a work in progress. While we have some formal ontologies and models to draw from, much of the content we are working to organize here is based in other types of discrete and not particularly well-connected datasets or less structured forms.

A central point of discussion and directory off to other parts of the framework where we are developing concepts and structure is found on the SPARQL examples page. Visit that page for a description of the major entity types, example queries associated with those entities, and links to deeper level details on the technical development.

Graph Federation PLUS Graph Interaction

After experimenting with this platform for about a year, we're starting to get some clarity on how to frame out more of an operational infrastructure for the Geoscience Knowledgebase idea. We ended up rebuilding quite a number of entities here that don't really need to be in this Wikibase instance other than the need to link to them effectively from other entities. This includes reference material such as geographic places, "minerals," commodities, etc. Some of these were/are not in the best shape anyway in terms of their own alignment with linked open data concepts (e.g., things in source datasets that should explicitly link to something with semantic depth and definition do not). So, doing some work on those to get them into proper RDF+OWL alignment is not wasted effort. However, it would be better overall to do some work at the source or in between to provide more well-formed OWL structured data and then federate with those graphs on persistent resolvable identifiers.

Where we really need the Wikibase functionality is to support human interaction with evolving graphs and the knowledgebase as a whole. One major area we've been exploring on this is claim uncertainty reduction. We might generate a whole set of claims based on an AI-assisted process, but then we need a place where sometimes competing claims about the same thing can be evaluated by subject matter experts who can record something about their expert judgments. Some of this might be surfaced by watching how queries are formed (e.g., what claims are trusted in practical use). But we also need a place where users can interact with the knowledgebase as a whole and introduce new information. This is part of why we started down the Wikibase path - all the different tooling already developed like OpenRefine and a mature API.

There are two big architectural issues that exist currently that would need to be addressed for both effective graph federation and user interactions to flourish.

Effective Foreign Ontology/Graph Linking

What I've experimented with so far to federate content from the GeoKB Wikibase with other graphs is the Qlever SPARQL platform. I think we can get a ton of great functionality at the SPARQL interface level with this capability. I'm also exploring the full text indexing functionality with Qlever that would be an enhancement on what we have been doing with loading larger unstructured text and/or full original content into "item talk" pages. Doing this completely within Wikibase means we would need to build some kind of abstraction that integrates Wikimedia API search with SPARQL.

In Wikibase, we really want as many properties as possible to be of the "item" type. What this essentially means is that they have to have something "real" on the other end to link to vs. a text value with no explicit semantic definition. We need some type of architectural shift that allows a Wikibase instance to be aware of and federated with many other graphs, including formal ontologies, and then a new property type that is an item/entity in one of those other graphs. We want those items to act just like a local Wikibase item, especially in terms of the UI that I'll discuss in the next section.

If we did have the ability to federate with vs. recreate locally the various external graphs that a given knowledgebase needs to work with, we need those foreign items to masquerade like they are local to the given Wikibase. This includes things like type-ahead search support and perhaps even functionality that lets one of those foreign items have a linkable web page directly within a given Wikibase instance. This could get really sophisticated with configurable settings that don't make those localized federated entities editable, or perhaps a way to allow users of a Wikibase instance to add to but not take away from a federated entity.

UI Enhancements and Plugin Support

The current Wikibase.cloud environment does not support the addition of the many plugins and components that have developed in the open-source Wikimedia landscape. This is partly because of the dated but still effective Wikimedia technology base that comes with some serious risks for user malfeasance via Javascript and other injection attacks. However, there are some things that we really should get in play.

At the top of the list is inline conformance checking where deviance from property constraints are highlighted visually. As I understand it, there are basically two methods that Wikibase/Wikidata supports currently to highlight or force compliance with a specific schema - ShEx and property constraints (which require Wikimedia plugin and configuration). The property constraints approach pre-dates the work with ShEx, and so Wikidata developed and incorporated the UI methods around that architecture. I'd like to take this a step further and build something at the data layer that indicates non-conformance with applicable schema definitions from ShEx. I'd like SPARQL results to come along with what would amount to a "buyer beware" statement - go ahead and use this information as you see fit, but be aware that there are these specific issues according to the schemas this knowledgebase conforms with.

Lower down the list are the visual helpers for globe coordinate and Commons image links. There's cool functionality built into the Wikidata UI that would be great to see for wikibase.cloud instances. The little map preview for coordinate location properties is the most important feature we'd like to see as it's often the case where we are recording multiple competing location claims for a given entity. It would be great to be able to see this visually, perhaps even enhancing the existing plugin to show multiple points on a map for a given set of claims on the same property. The commons image preview is important but secondary, with a larger background issue on being able to point to multiple image repositories beyond Wikimedia Commons.

Partnering

In USGS, we are committed to the goals and ideals of open science and are working to improve our practices in line with those principles. We don't as yet have a formal route to collaborate on this particular project, but we're working to introduce it and bring it into the community for collaboration via the Earth Science Information Partners (ESIP). Look for a presentation on this work at the ESIP Summer Meeting in Burlington, VT in July, 2023.

A note about something called iSAID

An earlier effort in building a graph-based knowledgebase structure was something we called iSAID. This was a somewhat tongue-in-cheek name that stands for the Integrated Science Assessment Information Database. iSAID was motivated by the need to comprehensively look across the USGS at our entire scientific portfolio to characterize and understand our capacity to pursue new scientific objectives. The nodes in the iSAID graph include people, organizations, projects, publications, datasets, and models. iSAID is still a going concern, and we're now working on bringing the public aspect of iSAID (which is about 90% or more) into the Wikibase model here. We need all of those same entities represented in the knowledgebase for the other use cases we are pursuing.

Disclaimer

This is an experimental effort that will only deal with an organization and portrayal of already publicly available data, information, and knowledge. Everything we incorporate here, directly attributable to the USGS, is from officially released, peer reviewed material. Our pathway from institutional knowledge development processes to releasable data, software, and knowledge products is guided by policies in our Fundamental Science Practices. How these policies are applied to what is a very different form in an integrated knowledgebase is part of our experimentation and development work.