Item talk:Q44146
Reconciling USMIN Dataset to Create/Update Entries With OpenRefine
Purpose
Given that there are datasets related to USMIN within the sciencebase website, we want to transfer related mine information into the GeoScience KnowlegdeBase to add more related information into the knowledge graph already availailable. By doing so, new information and insights can be gathered and processed as desired. We will be using a tool called OpenRefine, a GUI meant for reconciliation with many different wikibase instances as well as upserting any relevant data to the desired instance.
Prerequisites
This discussion is assuming you have the following installed
- Docker
Retrieving the USMIN dataset
For this example, we will be using the USGS_TopoMineSymbols_ver9_Geodatabase.zip
downloadable file from the sciencebase item above. Once unzipped we can use the USGS_TopoMineSymbols_ver9.gdb
folder.
As of this writing version 10 of the dataset has been released, with
USGS_TopoMineSymbols_ver10_Geodatabase.zip
being the replacement of the example that was used. We will have to determine whether we need to update the version depending on if there is more relevant data that we need to insert into the geoKB.
The desired format for processing data using OpenRefine is either a csv
or xlsx
file, so because of that we will need to convert this gdb
into csv
format. We can do this using the OpenGIS Simple Features Reference Implementation (OGR), specifically ogr2ogr
command found in the Geospatial Data Abstraction Library (GDAL). One feature available within is the ogrinfo
command, which we can use to find out the metadata of the gdb file that was just downloaded. This is helpful to get an understanding of the feature layers, fields and property types.
Example command used for topominesymbols (24k)
using ogrinfo:
docker run --rm -v $(pwd):/home ghcr.io/osgeo/gdal:alpine-normal-latest ogrinfo -ro -so -al /home/USGS_TopoMineSymbols_ver9.gdb
where -ro
opens the data source in read-only mode, -so
outputs the info in summary mode, and -al
lists all the layers found within. More options can be found int the GDAL documentation. Because this is just for demonstration purposes the layer that will be used is USGS_TopoMineSymbols_24k_Points
, although there are larger layers with much more information that can be used for processing.
The command above will pull the GDAL docker image automatically if the latest version is not found locally.
Now that we know which layer needs to be converted, we can used ogr2ogr
to convert from gdb to csv. This can be done using the following command in the terminal.
docker run --rm -v $(pwd):/home ghcr.io/osgeo/gdal:alpine-normal-latest ogr2ogr -f CSV -lco GEOMETRY=AS_XY /home/USGS_TopoMineSymbols_24k_Points.csv /home/USGS_TopoMineSymbols_ver9.gdb USGS_TopoMineSymbols_24k_Points
Notice that this command specifies that we want the lat and long coordinates (
-lco
), which is a crucial step since without them it will make the process much harder due to the nature of the outputted csv data.
Now that the csv file is ready, we can open the site https://openrefine.demo5280.com/ and create a new project. Once that is done we can start the reconciliation and preprocessing needed to upsert any new and relevant information.
After evaluating the data we wanted to accomplish these goals: - For each record - find an existing mine within 10km of the provided location in the CSV with a known location - associate the location to the existing mine - associate additional feature properties - associating wikidata same as
items for feature properties (ie. Ftr_type
) - reference USMIN as the Knowledge Source to these claims
Matching Data to Closest Mines by Location
First thing of note is that the ftr_name
field does not have much info on what the mines are called or associated to. To solve this issue, we need to find the closest match to a mine that already exists within the GeoKB. Specifically, we use SPARQL and the built-in wikibase functionalities to determine the best candidates to match the data to. In OpenRefine, select the dropdown arrow of a column (doesn’t really matter which one), then select Edit column
→ Add column by fetching URLs
. This will give a pop up window where you can enter a command or script to make a URL request. Because Python is more well known than the standard selection of General Refine Expression Language(GREL), we first need to switch the language to Python/Jython
.
Note: There are some standard libraries already added in to the system so we can use the following script without any additional work to create a SPARQL request to the https://geokb.wikibase.cloud/query/sparql endpoint.
Python to paste into the textarea to retrieve the item closest to specified coordinate:
from urllib import quote_plus
x=cells['X'].value
y=cells['Y'].value
query='''
PREFIX wd: <https://geokb.wikibase.cloud/entity/>
PREFIX wdt: <https://geokb.wikibase.cloud/prop/direct/>
SELECT ?item ?itemLabel ?location ?distance
WHERE {
?item wdt:P1 wd:Q3646 .
SERVICE wikibase:around {
?item wdt:P6 ?location .
bd:serviceParam wikibase:center "POINT('''+x+''' '''+y+''')"^^geo:wktLiteral .
bd:serviceParam wikibase:radius "10" .
bd:serviceParam wikibase:distance ?distance.
}
SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
}
ORDER BY ASC(?distance)
LIMIT 1
'''
encoded_query=quote_plus(query)
return 'https://geokb.wikibase.cloud/query/sparql?query='+encoded_query+'&format=json'
This will output the following or something similar for each entry of the csv.
{
"head" : {
"vars" : [ "item", "itemLabel", "location", "distance" ]
},
"results" : {
"bindings" : [ {
"item" : {
"type" : "uri",
"value" : "https://geokb.wikibase.cloud/entity/Q5640"
},
"location" : {
"datatype" : "http://www.opengis.net/ont/geosparql#wktLiteral",
"type" : "literal",
"value" : "Point(-87.1258288 33.1237301)"
},
"distance" : {
"datatype" : "http://www.w3.org/2001/XMLSchema#double",
"type" : "literal",
"value" : "2.739"
},
"itemLabel" : {
"xml:lang" : "en",
"type" : "literal",
"value" : "Hill Creek Mine (138378)"
}
} ]
}
}
To parse the response given into new columns, we can use the following Python script to read the json and extract the relevant data. Select Edit column
→ Add column based on this column
found in the dropdown menu for the closest_item_response
field that just created and add the following Python script in the textarea to create closest_item_label
column:
import json
val_dict = json.loads(value)
res = val_dict['results']['bindings'][0]['itemLabel']['value']
return res
Similarly, we can create the item
, location
, and distance
columns by replacing the itemLabel
variable found in the script above.
Additional Preprocessing
One thing of note is that the data only has the abbreviation of the states where the mines are located. This is not as useful as we need it to be due to the fact that the items found in the GeoKB are listed as their respective full names. Fortunately there is an easy and efficient fix to this. We can create a new field state_name_full
that returns the full name using Python and a dictionary (we can use the same method to Add the columns based on other columns
done previously).
state_name = { 'AL': 'Alabama',
'AK': 'Alaska',
'AZ': 'Arizona',
'AR': 'Arkansas',
'CA': 'California',
'CO': 'Colorado',
'CT': 'Connecticut',
'DE': 'Delaware',
'FL': 'Florida',
'GA': 'Georgia',
'HI': 'Hawaii',
'ID': 'Idaho',
'IL': 'Illinois',
'IN': 'Indiana',
'IA': 'Iowa',
'KS': 'Kansas',
'KY': 'Kentucky',
'LA': 'Louisiana',
'ME': 'Maine',
'MD': 'Maryland',
'MA': 'Massachusetts',
'MI': 'Michigan',
'MN': 'Minnesota',
'MS': 'Mississippi',
'MO': 'Missouri',
'MT': 'Montana',
'NE': 'Nebraska',
'NV': 'Nevada',
'NH': 'New Hampshire',
'NJ': 'New Jersey',
'NM': 'New Mexico',
'NY': 'New York',
'NC': 'North Carolina',
'ND': 'North Dakota',
'OH': 'Ohio',
'OK': 'Oklahoma',
'OR': 'Oregon',
'PA': 'Pennsylvania',
'RI': 'Rhode Island',
'SC': 'South Carolina',
'SD': 'South Dakota',
'TN': 'Tennessee',
'TX': 'Texas',
'UT': 'Utah',
'VT': 'Vermont',
'VA': 'Virginia',
'WA': 'Washington',
'WV': 'West Virginia',
'WI': 'Wisconsin',
'WY': 'Wyoming',
'DC': 'District of Columbia',
'AS': 'American Samoa',
'GU': 'Guam',
'MP': 'Northern Mariana Islands',
'PR': 'Puerto Rico',
'UM': 'United States Minor Outlying Islands',
'VI': 'U.S. Virgin Islands'
}
return state_name[cells['State'].value]
According to the OpenRefine documentation for schema formatting, we will need to parse the returned WKT of the closest_item_loc
column that was generated. There’s two ways to solve this: - parsing the coordinate value when it is retrieved from the response - creating a new column with the coordinates specified in the USMIN dataset
To do the former, we can use the following script.
import json
import re
val_dict = json.loads(cells['closest_item_response'].value)
res = val_dict['results']['bindings'][0]['location']['value']
lon,lat = re.findall('-?\d+\.\d+',res)
return lat+','+lon
If we just want to create a new column from the X
and Y
columns just go to Edit column
→ Add columns based on this column
and add this Python script in the textarea.
return cells['Y'].value + ',' + cells['X'].value
Since we know where the data originated from, we can create one more column named data_source
with the following Python added into the textarea.
return 'https://www.sciencebase.gov/catalog/item/5a1492c3e4b09fc93dcfd574'
This will be used as a reference URL when building the schema.
Reconcile the Data
Now that we have the parsed information on the closest mine and added it alongside each data entry of the csv, we can reconcile the relevant information back to the GeoKB. For example, to reconcile the state name to the corresponding item found in the GeoKB, we select Reconcile
→ Start reconciling
. Under Services, Select Reconcile for GeoKB (en)
and ensure that U.S. State (Q229)
bullet is checked, then click on the Start reconciling
button found at the bottom right of the pop-up window. This will find the corresponding item and match them to each entry in the field. Once it is finished processing, double check to make sure that the corresponding items are matched correctly. If needed, select Search for match
under the reconciled data and select the best fit. This will change the other similar rows in the dataset automatically unless the Match this cell only
option is selected.
Other columns have been reconciled this way e.g. closest_item_label
column that was created from the URL request against the mine(Q3646)
item.
The benefit of reconciling in OpenRefine is that different columns can be reconciled using different Wiki services. This was actually done with Ftr_type
to retrieve the QId values found in Wikidata. To do this we select the dropdown for the reconciled Ftr_type
column then selecting Edit column
→ Add columns from reconciled values
and clicking on Qid
found in the Suggested properties
textarea.
If needed, the QId was returned in the URL request within the item
value as a URI. Using regular expressions we can extract the value within using the following Python code to create closest_QID
column.
import re
return re.search('Q\d+', str(value)).group()
Building the schema
Now that the data has been processed and reconciled, we can build a schema before submitting the changes to the GeoKB. This can be done using the schema
section of OpenRefine, which is located on the top middle of the GUI (If it doesn’t show for some reason, another way to get there is to select Wikibase
→ Edit Wikibase schema
found under the Wikibase extension on the top right of the GUI). Here is the schema that was exported that can be imported back in using Wikibase
→ Manage schemas
→ Choose file
.
{
"entityEdits": [
{
"type": "wbitemeditexpr",
"subject": {
"type": "wbentityvariable",
"columnName": "closest_item_label"
},
"nameDescs": [
{
"name_type": "LABEL_IF_NEW",
"value": {
"type": "wbmonolingualexpr",
"language": {
"type": "wblanguageconstant",
"id": "en",
"label": "en"
},
"value": {
"type": "wbstringvariable",
"columnName": "closest_item_label"
}
}
}
],
"statementGroups": [
{
"property": {
"type": "wbpropconstant",
"pid": "P34",
"label": "U.S. state",
"datatype": "wikibase-item"
},
"statements": [
{
"value": {
"type": "wbentityvariable",
"columnName": "state_name_full"
},
"qualifiers": [],
"references": [
{
"snaks": [
{
"prop": {
"type": "wbpropconstant",
"pid": "P31",
"label": "reference URL",
"datatype": "url"
},
"value": {
"type": "wbstringvariable",
"columnName": "ref_url"
}
}
]
}
],
"mergingStrategy": {
"type": "snak",
"valueMatcher": {
"type": "lax"
}
},
"mode": "add_or_merge"
}
]
},
{
"property": {
"type": "wbpropconstant",
"pid": "P6",
"label": "coordinate location",
"datatype": "globe-coordinate"
},
"statements": [
{
"value": {
"type": "wblocationvariable",
"columnName": "closest_loc_parsed"
},
"qualifiers": [
{
"prop": {
"type": "wbpropconstant",
"pid": "P7",
"label": "publication date",
"datatype": "time"
},
"value": {
"type": "wbdatevariable",
"columnName": "Topo_Date"
}
},
{
"prop": {
"type": "wbpropconstant",
"pid": "P27",
"label": "data source",
"datatype": "wikibase-item"
},
"value": {
"type": "wbentityvariable",
"columnName": "data_source"
}
},
{
"prop": {
"type": "wbpropconstant",
"pid": "P117",
"label": "Development Status",
"datatype": "wikibase-item"
},
"value": {
"type": "wbentityvariable",
"columnName": "Remarks"
}
}
],
"references": [
{
"snaks": [
{
"prop": {
"type": "wbpropconstant",
"pid": "P31",
"label": "reference URL",
"datatype": "url"
},
"value": {
"type": "wbstringvariable",
"columnName": "ref_url"
}
}
]
}
],
"mergingStrategy": {
"type": "snak",
"valueMatcher": {
"type": "lax"
}
},
"mode": "add_or_merge"
}
]
},
{
"property": {
"type": "wbpropconstant",
"pid": "P35",
"label": "U.S. county",
"datatype": "wikibase-item"
},
"statements": [
{
"value": {
"type": "wbentityvariable",
"columnName": "County"
},
"qualifiers": [],
"references": [
{
"snaks": [
{
"prop": {
"type": "wbpropconstant",
"pid": "P31",
"label": "reference URL",
"datatype": "url"
},
"value": {
"type": "wbstringvariable",
"columnName": "ref_url"
}
}
]
}
],
"mergingStrategy": {
"type": "snak",
"valueMatcher": {
"type": "lax"
}
},
"mode": "add_or_merge"
}
]
},
{
"property": {
"type": "wbpropconstant",
"pid": "P1",
"label": "instance of",
"datatype": "wikibase-item"
},
"statements": [
{
"value": {
"type": "wbentityvariable",
"columnName": "Ftr_Type"
},
"qualifiers": [],
"references": [
{
"snaks": [
{
"prop": {
"type": "wbpropconstant",
"pid": "P31",
"label": "reference URL",
"datatype": "url"
},
"value": {
"type": "wbstringvariable",
"columnName": "ref_url"
}
}
]
}
],
"mergingStrategy": {
"type": "snak",
"valueMatcher": {
"type": "lax"
}
},
"mode": "add_or_merge"
}
]
}
]
}
],
"siteIri": "https://geokb.wikibase.cloud/entity/",
"entityTypeSiteIRI": {
"item": "https://geokb.wikibase.cloud/entity/",
"property": "https://geokb.wikibase.cloud/entity/",
"mediainfo": "https://geokb.wikibase.cloud/entity/"
},
"mediaWikiApiEndpoint": "https://geokb.wikibase.cloud/w/api.php"
}
Now that the schema is set the way we want, select Wikibase
→ Upload edits to Wikibase
. Log in if prompted. If there are warnings that pop up, then this means some adjustments to the values being inserted need to be modified to follow the correct format. Once all the warnings and errors are taken care of, add a summary in the textbox of the upsert being made, then click on Upload edits
to add the data into the GeoKB.
Some final steps need to be taken to ensure the schema is saved and you have logged out of your account. To save a schema, click on the Save new
button found in the schema section of the GUI. From there you can either save as a new schema by providing a name in the text box or we can overwrite an existing schema that was previously saved.
Warning: Simply clicking the
Save schema
button on the top right does not save your changes that have been made unless you loaded a previously saved schema. Any schema created from scratch will not be saved if overwritten.
To log out, go to Wikibase
found in the extensions section on the top right, and select Manage Wikibase account
. From there you can click the Log out
on the pop up window.