Property talk:P2: Difference between revisions

From geokb
 
Line 31: Line 31:
In the GeoKB, the concepts built with subclass of claims extending to "[[Item:Q2|origin]]" are the collection of major descriptors about things that we link to as subject matters associated with other actors in the graph. In some of the processing work we are doing to better align disconnected data, information, and knowledge systems as linked open data, we are running experiments on vector similarity analyses to turn "label-only objects" into identified objects in the graph. We are doing this using different open-source embedding models, placing both our linkable concepts (subclasses within the graph) and "mystery objects" into a vector store where we can run different kinds of similarity analyses (nearest neighbor, cosine similarity, Euclidian distance, etc.) to determine where we have a reasonable enough relationship to create a link with qualifiers indicating embedding model used and any other details.
In the GeoKB, the concepts built with subclass of claims extending to "[[Item:Q2|origin]]" are the collection of major descriptors about things that we link to as subject matters associated with other actors in the graph. In some of the processing work we are doing to better align disconnected data, information, and knowledge systems as linked open data, we are running experiments on vector similarity analyses to turn "label-only objects" into identified objects in the graph. We are doing this using different open-source embedding models, placing both our linkable concepts (subclasses within the graph) and "mystery objects" into a vector store where we can run different kinds of similarity analyses (nearest neighbor, cosine similarity, Euclidian distance, etc.) to determine where we have a reasonable enough relationship to create a link with qualifiers indicating embedding model used and any other details.


The following query can be used to pull together relevant items for embedding as potential target objects in vector space.
The following query can be used to pull together relevant items for embedding as potential target objects in vector space. This approach simply uses the basic descriptive elements of a concept, which can include some indication of its relationships with other concepts. We're also experimenting with more graph-specific models that embed the full relationship structure of the GeoKB ontology.


<sparql tryit="1">
<sparql tryit="1">

Latest revision as of 12:26, 31 May 2024

Subclass and Same As

Wherever possible, the GeoKB leverages classification schemes from established sources of definition and semantic organization within our scientific domains. When we do so, we are using the same as property to record a resolvable URL/identifier for the concept in the foreign source. At the same time, we are working within the Wikibase/Wikidata environment, which is a bit "peculiar" in that it is expressly built to be navigated through and worked on by users in practice. One artifact of this is that labels also matter, resulting in us making a number of design decisions in how this knowledge graph is organized.

Most domain-specific classification systems focus on defining things within their specific context, which is inevitably narrower than what we are doing here (let alone what happens in the global knowledge commons). The same label in one classification scheme (e.g., a domain ontology) may have a completely different meaning in some other classification scheme. Wikipedia deals with this through the idea of disambiguation pages that link off to specific "meanings" of the same word or concept. Wikidata ends up with a mix of approaches with a basic constraint built into the Wikibase technology that allows language-specific labels to be duplicated but will not allow an item with exactly the same label and description, essentially using the description as "disambiguator."

We are testing a little bit of a different approach here in that we attempt to bring about some level of synthesis as we process and curate various reference sources into our classification scheme. Where reasonable, we expand the definition of a specific concept so that it can be more than one thing. This may result in the same item having multiple same as linkages, some of which point to different items in the same source when we determine that, for our purposes, we can link to a single entity from multiple contexts and have it provide what's necessary in our graph. This doesn't always work out in practice, and so we come back and adjust as use cases point toward the need for more granular specificity.

While it's possible for an entity in the GeoKB to be both a subclass of some higher-level class and an instance of something, we generally try to keep the classification system as "clean" as possible. One example where entities co-exist as classifiers and instances of is the idea of mineral commodity where the items are part of some other classification system (e.g., mineral classification from the Geoscience Ontology) but then act as instances of mineral commodities. We could eventually turn the "commodity" into a classification system in its own right, but for now we have a way of querying for what we need on items that are reasonably defined in an explainable way.

In this approach, we also use same as expressly on classification items vs. a specific ExternalID property. For instance, we pulled in many of the topics from OpenAlex to use as classifiers for people, organizations, and creative works. We have an OpenAlex ID ExternalID property that is used for person entities because this essentially represents a direct and specific alternate identifier for a person within that system. While the same can be said for topics directly sourced from OpenAlex, we opted to use the same as property to create this linkage using the URL form of the OpenAlex identifier. Those same topical items will have same as links to Wikidata and some are the same concepts that we pulled in from earlier work with the USGS Thesaurus. We do the same with concepts tracking to Mindat, the Geoscience Ontology, and other sources. As needed we may start recording qualifiers of some sort on the same as relationships when there is some point of disagreement that should be noted - "same as but..."

Full Classification

It can be necessary and useful to pull the full classification from the GeoKB, including same as links, for use in processing source material for organization here. The following query is designed to do things like provide a simple lookup mapping on the identifiers from source systems we've recorded as same as claims.

PREFIX geokbe: <https://geokb.wikibase.cloud/entity/>
PREFIX geokbp: <https://geokb.wikibase.cloud/prop/direct/>

SELECT ?item ?itemLabel ?subclass_of ?subclass_ofLabel ?same_as 
WHERE {
  ?item geokbp:P2* geokbe:Q2 ; # Get all subclasses (transitively) starting from "entity" as the origin
        geokbp:P2 ?subclass_of .
  OPTIONAL {
    ?item geokbp:P84 ?same_as . # Get any same as declarations for the classification entities
  }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
}

Try it!


Vector Similarity

In the GeoKB, the concepts built with subclass of claims extending to "origin" are the collection of major descriptors about things that we link to as subject matters associated with other actors in the graph. In some of the processing work we are doing to better align disconnected data, information, and knowledge systems as linked open data, we are running experiments on vector similarity analyses to turn "label-only objects" into identified objects in the graph. We are doing this using different open-source embedding models, placing both our linkable concepts (subclasses within the graph) and "mystery objects" into a vector store where we can run different kinds of similarity analyses (nearest neighbor, cosine similarity, Euclidian distance, etc.) to determine where we have a reasonable enough relationship to create a link with qualifiers indicating embedding model used and any other details.

The following query can be used to pull together relevant items for embedding as potential target objects in vector space. This approach simply uses the basic descriptive elements of a concept, which can include some indication of its relationships with other concepts. We're also experimenting with more graph-specific models that embed the full relationship structure of the GeoKB ontology.

PREFIX geokbe: <https://geokb.wikibase.cloud/entity/>
PREFIX geokbp: <https://geokb.wikibase.cloud/prop/direct/>

SELECT ?item ?itemLabel ?itemDescription ?classLabel
(CONCAT(?itemLabel, " - ", ?itemDescription, " (", COALESCE(?itemAltLabel, ""), ")") AS ?embedding_text)
WHERE {
  ?item geokbp:P2* ?class .
  VALUES ?class { geokbe:Q3 geokbe:Q158710 geokbe:Q159046 }
  FILTER(?item != geokbe:Q3)
  FILTER(?item != geokbe:Q158710)
  FILTER(?item != geokbe:Q159046)
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
}

Try it!