GeoKB Mindat presentation

From geokb

The following is a set of web references and notes for a presentation on the GeoKB and how it relates to Mindat.

Why we're doing this in USGS and what the GeoKB is

  • Strategy Infographic
    • Briefly hit on why we're doing this and the strategy we are taking. Focus today is on the knowledge-banking piece.
  • GeoKB
    • We're building this capability in Wikibase as part of the wikibase.cloud beta environment being developed by Wikimedia Deutschland. We're doing that partly to take advantage of an instance of tech that we don't have to build out and operate ourselves but mostly because of the underlying ethos. The idea with wikibase.cloud is to have many domain-specific knowledge bases sitting adjacent to Wikidata, which we can consider the Global Knowledge Commons. This is a powerful concept to latch onto. We've learned from the "Open Access Movement" in government, that it's not enough to just put our data and information out there without also contributing our knowledge, encoded in ways that humans and AIs can use.
  • Invitation
    • One of the other interesting things in using this technology is that it is all based on Mediawiki, the same platform that has been powering Wikipedia since January 15, 2001. As such, there are all kinds of things we can do here beyond building triples. In fact, the triple store in Wikibase is an add-on. So, we're also experimenting with "using the tool to build the tool."

What we've done in the past and are doing now that the GeoKB builds on

MRData

  • MRData
    • We have done this work to some extent in pockets around our organization, including in the mineral resources realm. MRData was the result of one guy's (Peter Schweitzer) passionate pursuit of an interconnected suite of all the major data and information we have and produce.
  • MRData Science Topic Catalog
    • It's a web interface with some degree of machine access to thousands of databases and discrete products connected through earth system concepts, some of which have reasonable semantic depth. In addition to topics coming from the USGS Thesaurus, there are also other reference lists (more code lists than anything) that provide a little more depth and connectivity through the system. It's not a graph per se, but it has graph-like characteristics.
  • MRDS
    • Some of the connected databases that are part of the MRData suite are built out from a long legacy. The Mineral Resources Data System goes back to former US Bureau of Mines sources. It was meant to be a living data system of the best available information on mineral deposits, prospecting history, and related data (nationally and globally), which is one of the most important inputs to our mineral resource assessments.
  • Blackbird Mine, Idaho Cobalt, Idaho Cobalt Operations
    • MRDS provides a lot of the information in this area we need moving forward. However, it is not designed for multimodal input. We need a knowledge bank that multiple people and processes can contribute to through time.
    • Show the progression through the only cobalt mining in the U.S.
    • It's also not set up to accommodate other dimensions on the same entities related to other scientific domains that USGS is engaged in. In this case, that same mine has other things USGS as a whole needs to know about like connections to mine waste and pollutants into the watersheds it sits within that flow into the Salmon River in Idaho impacting salmon reintroduction efforts from both the Federal government and the Shoshone-Bannock tribe. We need a system that can accommodate ever increasing detail across a limitless set of conceptual dimensions.

ScienceBase

  • ScienceBase example dataset
    • We also built ScienceBase. It's original design was inspired by a discussion I had with the founders of Freebase way back in the day, where I saw a need to provide a platform where science teams could organize everything needed in a given research pursuit into one interlinked system.
    • However, ScienceBase has ended up conforming more to how we are still doing business - creating and working with a bunch of only implicitly connected resources that are packaged into discrete databases, large and small. Here's an example of one of those that's a perfectly good effort in itself - digitizing all the only mine related features from historic topo maps. This is great, but the entities in something like this need to connect in some ways to other information about the same entities and everything related to them, including the mineral commodities mined from the "geographic features."
  • ScienceBase Facets
    • That said, ScienceBase has 17.6M things in it that showed up there because it's filling a need. It's a supported repository for all kinds of scientific content, including digital representations or documentation of physical assets. Some of it is connected to other things, inside and outside ScienceBase, but they are still mostly individual collections and items that are basically disjoint and not interfaced with as a whole.

Geoconnex and IoW

  • Geoconnex
    • Another effort that is getting us closer to what we need is the Geoconnex work under the Internet of Water project, in which USGS Water is a partner. While this is primarily focused pretty tightly on addressing one particular issue - geographic features on surface water connected to the hydrographic network, it starts things rolling toward a whole lot of other things that need to be connected to that graph.
  • Geoconnex GraphDB
    • It uses the right underlying model (RDF) that will allow the system to extend into any area that it needs to in order to address other use cases. (Click on a stream gage item to see current properties.)

Knowledge representation approach in the GeoKB

  • Entities/Examples > mines
    • So, coming back to the GeoKB and our vision for this slice of USGS, we're iterating through all the different entities we need to have in our knowledge base in order to address our problems. Right now, we are focusing some time on prospecting history as one of the major inputs to the mineral resource assessment process. What mining operations are going on now and back through time for a given set of minerals? Where are the mines? What's their geoscientific context in terms of mineral deposit types and other aspects of their geologic setting? What's been extracted from the mining sites in ore, and what do we know about the grade or mineral content of those ores?
  • Mine item discussion and example query
    • To get at these questions, we need to do some of what Geoconnex deals with in terms of identifying geographic features and nailing down where mines are located. We need to do things like decide which of a set of multiple competing claims about geographic location is most suitable for our purposes. As we start to get this information pulled together dynamically from the many different sources, we can start to do things like this with sparql - look for mines with more than one coordinate location claim.
  • Example from query (coal mine from Oliveros' work)
    • This is one of the most important features of Wikibase (or really RDF in general) that is so important. Having a wiki-style interface in this case is actually pretty important at this phase of our work. People need to see this idea visually as well as in data form. These are not facts. They are assertions about some property or characteristic of an entity. We have to examine the details and the evidence behind the assertion to decide whether we can use it for our particular purpose. Another user or another driving purpose may mean that a different assertion is used. That's fine, we want both use cases to be satisfied simultaneously. We're not going for the single point of truth because we don't know what that is. We can look at the references and qualifiers and other details to decide what's most appropriate or even decide that all claims about something are reasonable and run a calculation to blend them into something dynamic (e.g., center of a group of point locations).
  • History from example
    • Another key aspect of this is provenance. We get something about where things come from in the references, but we also need to know how they got into the knowledge base. Who introduced the claims? Or what process introduced the claims? If it's a software process, do we trust that it operated correctly? We're still working through some of the conventions on how to do this best, but the built in history tracking and ability to get back to what things looked like before a change was introduced are incredibly compelling features of the Mediawiki-based tech platform (either Wikibase or Semantic Mediawiki).

How we are leveraging Mindat as a source and connecting point

  • Mindat
    • So, let's get down to Mindat and its connection. While USGS is an authoritative source for some geoscience things, we are not the global authority on a lot of things like mineral taxonomy and rock names and classification. The IMA is the authority for mineral species (with some scientific debate on that). Work that Steve and others have led through cooperative bodies like the IUGS Commission for the Management and Application of Geoscience Information (CGI) has built awesome formal ontologies for some of these major references.
    • Mindat provides the most comprehensive source of a bunch of what we need to reference in the area of named and characterized minerals, mineral commodities, and rock names and classification. It's a community resource being contributed to by experts in the field around the world. It's not static but in regular development as scientific understanding advances.
  • Entities/Examples > rock classes
    • Those are the three things we are pulling from Mindat - rocks, minerals, and commodities. We're only grabbing a bare minimum of the information we need just to establish named entities with enough of the linkable elements in the GeoKB.
  • * Entities/Examples > multi-instance commodities
    • We're also experimenting with a design decision to organize these entities in a different way than Mindat has used in their data model. Rather than separate entities in the GeoKB for minerals that are also identifiable as commodities and even chemical elements, we are focusing on the labels that are the targets of our links from other entities, essentially classifying them as instances of more than one type of thing. These assertions point to where we get that assertion from Mindat (or other sources) and the associated specific identifier.
  • Cobalt
    • So, we can see in an example like the concept for cobalt that it is a mineral according to Mindat. It's also a commodity according to both Mindat and our MRDS system, each with their own identifiers. And it happens to be a single-element mineral, so the same named concept also refers to the chemical element, cobalt. I'm not sure this is yet the best knowledge organization strategy, but I'm experimenting with the lumping over splitting idea to see where it takes us.
    • Part of the reason I'm exploring this is the adjacency to the Global Knowledge Commons and the desire to have this content stubbable into Wikidata. There's only one entity in Wikidata that's not a town or a color or a magazine. How do we communicate to the general public that this thing we call cobalt has multiple, context-dependent meanings but it's kind of all the same thing from a crude geoscientific context.
  • GeoKB repo
    • So, how do these claims/statements get from Mindat to the GeoKB. We're working on a few different pathways:
      • Bot code written using wikibaseintegrator in a workflow
      • Bot code written to work through pre-processed data with a mapping configuration (uses pyWikibot)
      • Open Refine for cases where someone wants to do the refinement and alignment in table-ville
  • Mindat notebook
    • Walk through the basics of what we are doing.
  • cobalt
    • To wrap up, we are doing all of this so that we end up with a knowledge organization system that is permanent but mutable and constantly improving where we capture the interlinkages between all of the important entities and concepts we need to use in our work. When we go to something like cobalt, we want that to be the best representation of that concept we can get, and we want to know that it links to and pulls useful statements about itself from another source we can consult when we want to go after more information. We want to easily get to all the things in our knowledgebase that link to that concept, and we want the other environments we're working with like Mindat to be able to confidently connect through to anything we are adding to the mix that they might be interested in. We also want other people to tap this whole graph and decide what concepts and assertions are reasonable to pull out into other contexts, including the global knowledge commons.