Item talk:Q44146

From geokb
Revision as of 20:41, 20 June 2023 by Joliveros (talk | contribs) (→‎Additional Preprocessing: removed unnecessary step)

Reconciling USMIN Dataset to Create/Update Entries With OpenRefine

Purpose

Given that there are datasets related to USMIN within the sciencebase website, we want to transfer related mine information into the GeoScience KnowlegdeBase to add more related information into the knowledge graph already availailable. By doing so, new information and insights can be gathered and processed as desired. We will be using a tool called OpenRefine, a GUI meant for reconciliation with many different wikibase instances as well as upserting any relevant data to the desired instance.

Prerequisites

This discussion is assuming you have the following installed

  • Docker

Retrieving the USMIN dataset

For this example, we will be using the USGS_TopoMineSymbols_ver9_Geodatabase.zip downloadable file from the sciencebase item above. Once unzipped we can use the USGS_TopoMineSymbols_ver9.gdb folder.

As of this writing version 10 of the dataset has been released, with USGS_TopoMineSymbols_ver10_Geodatabase.zip being the replacement of the example that was used. We will have to determine whether we need to update the version depending on if there is more relevant data that we need to insert into the geoKB.

The desired format for processing data using OpenRefine is either a csv or xlsx file, so because of that we will need to convert this gdb into csv format. We can do this using the OpenGIS Simple Features Reference Implementation (OGR), specifically ogr2ogr command found in the Geospatial Data Abstraction Library (GDAL). One feature available within is the ogrinfo command, which we can use to find out the metadata of the gdb file that was just downloaded. This is helpful to get an understanding of the feature layers, fields and property types.

Example command used for topominesymbols (24k) using ogrinfo:

docker run --rm -v $(pwd):/home ghcr.io/osgeo/gdal:alpine-normal-latest ogrinfo -ro -so -al /home/USGS_TopoMineSymbols_ver9.gdb

where -ro opens the data source in read-only mode, -so outputs the info in summary mode, and -al lists all the layers found within. More options can be found int the GDAL documentation. Because this is just for demonstration purposes the layer that will be used is USGS_TopoMineSymbols_24k_Points, although there are larger layers with much more information that can be used for processing.

The command above will pull the GDAL docker image automatically if the latest version is not found locally.

Now that we know which layer needs to be converted, we can used ogr2ogr to convert from gdb to csv. This can be done using the following command in the terminal.

 docker run --rm -v $(pwd):/home ghcr.io/osgeo/gdal:alpine-normal-latest ogr2ogr -f CSV -lco GEOMETRY=AS_XY /home/USGS_TopoMineSymbols_24k_Points.csv /home/USGS_TopoMineSymbols_ver9.gdb USGS_TopoMineSymbols_24k_Points

Notice that this command specifies that we want the lat and long coordinates (-lco), which is a crucial step since without them it will make the process much harder due to the nature of the outputted csv data.

Now that the csv file is ready, we can open the site https://openrefine.demo5280.com/ and create a new project. Once that is done we can start the reconciliation and preprocessing needed to upsert any new and relevant information.

After evaluating the data we wanted to accomplish these goals: - For each record - find an existing mine within 10km of the provided location in the CSV with a known location - associate the location to the existing mine - associate additional feature properties - associating wikidata same as items for feature properties (ie. Ftr_type) - reference USMIN as the Knowledge Source to these claims

Matching Data to Closest Mines by Location

First thing of note is that the ftr_name field does not have much info on what the mines are called or associated to. To solve this issue, we need to find the closest match to a mine that already exists within the GeoKB. Specifically, we use SPARQL and the built-in wikibase functionalities to determine the best candidates to match the data to. In OpenRefine, select the dropdown arrow of a column (doesn’t really matter which one), then select Edit columnAdd column by fetching URLs. This will give a pop up window where you can enter a command or script to make a URL request. Because Python is more well known than the standard selection of General Refine Expression Language(GREL), we first need to switch the language to Python/Jython.

Note: There are some standard libraries already added in to the system so we can use the following script without any additional work to create a SPARQL request to the https://geokb.wikibase.cloud/query/sparql endpoint.

Python to paste into the textarea to retrieve the item closest to specified coordinate:

from urllib import quote_plus
x=cells['X'].value
y=cells['Y'].value
query='''
PREFIX wd: <https://geokb.wikibase.cloud/entity/>
PREFIX wdt: <https://geokb.wikibase.cloud/prop/direct/>

SELECT ?item ?itemLabel ?location ?distance
WHERE {
  ?item wdt:P1 wd:Q3646 .
  SERVICE wikibase:around { 
      ?item wdt:P6 ?location . 
      bd:serviceParam wikibase:center "POINT('''+x+''' '''+y+''')"^^geo:wktLiteral . 
      bd:serviceParam wikibase:radius "10" . 
      bd:serviceParam wikibase:distance ?distance.
  }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
}
ORDER BY ASC(?distance)
LIMIT 1
'''
encoded_query=quote_plus(query)
return 'https://geokb.wikibase.cloud/query/sparql?query='+encoded_query+'&format=json'

This will output the following or something similar for each entry of the csv.

{
  "head" : {
    "vars" : [ "item", "itemLabel", "location", "distance" ]
  },
  "results" : {
    "bindings" : [ {
      "item" : {
        "type" : "uri",
        "value" : "https://geokb.wikibase.cloud/entity/Q5640"
      },
      "location" : {
        "datatype" : "http://www.opengis.net/ont/geosparql#wktLiteral",
        "type" : "literal",
        "value" : "Point(-87.1258288 33.1237301)"
      },
      "distance" : {
        "datatype" : "http://www.w3.org/2001/XMLSchema#double",
        "type" : "literal",
        "value" : "2.739"
      },
      "itemLabel" : {
        "xml:lang" : "en",
        "type" : "literal",
        "value" : "Hill Creek Mine (138378)"
      }
    } ]
  }
}

To parse the response given into new columns, we can use the following Python script to read the json and extract the relevant data. Select Edit columnAdd column based on this column found in the dropdown menu for the closest_item_response field that just created and add the following Python script in the textarea to create closest_item_label column:

import json
val_dict = json.loads(value)
res = val_dict['results']['bindings'][0]['itemLabel']['value']
return res

Similarly, we can create the item, location, and distance columns by replacing the itemLabel variable found in the script above.

Additional Preprocessing

One thing of note is that the data only has the abbreviation of the states where the mines are located. This is not as useful as we need it to be due to the fact that the items found in the GeoKB are listed as their respective full names. Fortunately there is an easy and efficient fix to this. We can create a new field state_name_full that returns the full name using Python and a dictionary (we can use the same method to Add the columns based on other columns done previously).

state_name = { 'AL': 'Alabama',
  'AK': 'Alaska',
  'AZ': 'Arizona',
  'AR': 'Arkansas',
  'CA': 'California',
  'CO': 'Colorado',
  'CT': 'Connecticut',
  'DE': 'Delaware',
  'FL': 'Florida',
  'GA': 'Georgia',
  'HI': 'Hawaii',
  'ID': 'Idaho',
  'IL': 'Illinois',
  'IN': 'Indiana',
  'IA': 'Iowa',
  'KS': 'Kansas',
  'KY': 'Kentucky',
  'LA': 'Louisiana',
  'ME': 'Maine',
  'MD': 'Maryland',
  'MA': 'Massachusetts',
  'MI': 'Michigan',
  'MN': 'Minnesota',
  'MS': 'Mississippi',
  'MO': 'Missouri',
  'MT': 'Montana',
  'NE': 'Nebraska',
  'NV': 'Nevada',
  'NH': 'New Hampshire',
  'NJ': 'New Jersey',
  'NM': 'New Mexico',
  'NY': 'New York',
  'NC': 'North Carolina',
  'ND': 'North Dakota',
  'OH': 'Ohio',
  'OK': 'Oklahoma',
  'OR': 'Oregon',
  'PA': 'Pennsylvania',
  'RI': 'Rhode Island',
  'SC': 'South Carolina',
  'SD': 'South Dakota',
  'TN': 'Tennessee',
  'TX': 'Texas',
  'UT': 'Utah',
  'VT': 'Vermont',
  'VA': 'Virginia',
  'WA': 'Washington',
  'WV': 'West Virginia',
  'WI': 'Wisconsin',
  'WY': 'Wyoming',
  'DC': 'District of Columbia',
  'AS': 'American Samoa',
  'GU': 'Guam',
  'MP': 'Northern Mariana Islands',
  'PR': 'Puerto Rico',
  'UM': 'United States Minor Outlying Islands',
  'VI': 'U.S. Virgin Islands' 
}
return state_name[cells['State'].value]

According to the OpenRefine documentation for schema formatting, we will need to parse the returned WKT of the closest_item_loc column that was generated. There’s two ways to solve this: - parsing the coordinate value when it is retrieved from the response - creating a new column with the coordinates specified in the USMIN dataset

To do the former, we can use the following script.

import json
import re
val_dict = json.loads(cells['closest_item_response'].value)
res = val_dict['results']['bindings'][0]['location']['value']
lon,lat = re.findall('-?\d+\.\d+',res)
return lat+','+lon

If we just want to create a new column from the X and Y columns just go to Edit columnAdd columns based on this column and add this Python script in the textarea.

return cells['Y'].value + ',' + cells['X'].value


Reconcile the Data

Now that we have the parsed information on the closest mine and added it alongside each data entry of the csv, we can reconcile the relevant information back to the GeoKB. For example, to reconcile the state name to the corresponding item found in the GeoKB, we select ReconcileStart reconciling. Under Services, Select Reconcile for GeoKB (en) and ensure that U.S. State (Q229) bullet is checked, then click on the Start reconciling button found at the bottom right of the pop-up window. This will find the corresponding item and match them to each entry in the field. Once it is finished processing, double check to make sure that the corresponding items are matched correctly. If needed, select Search for match under the reconciled data and select the best fit. This will change the other similar rows in the dataset automatically unless the Match this cell only option is selected.

Other columns have been reconciled this way e.g. closest_item_label column that was created from the URL request against the mine(Q3646) item.

The benefit of reconciling in OpenRefine is that different columns can be reconciled using different Wiki services. This was actually done with Ftr_type to retrieve the QId values found in Wikidata. To do this we select the dropdown for the reconciled Ftr_type column then selecting Edit columnAdd columns from reconciled values and clicking on Qid found in the Suggested properties textarea.

If needed, the QId was returned in the URL request within the item value as a URI. Using regular expressions we can extract the value within using the following Python code to create closest_QID column.

import re
return re.search('Q\d+', str(value)).group()

Building the schema

Now that the data has been processed and reconciled, we can build a schema before submitting the changes to the GeoKB. This can be done using the schema section of OpenRefine, which is located on the top middle of the GUI (If it doesn’t show for some reason, another way to get there is to select WikibaseEdit Wikibase schema found under the Wikibase extension on the top right of the GUI). Here is the schema that was exported that can be imported back in using WikibaseManage schemasChoose file.

{
    "name": "USMIN schema",
    "schema": {
        "entityEdits": [
            {
                "type": "wbitemeditexpr",
                "subject": {
                    "type": "wbentityvariable",
                    "columnName": "closest_item_label"
                },
                "statementGroups": [
                    {
                        "property": {
                            "type": "wbpropconstant",
                            "pid": "P11",
                            "label": "located in the administrative territorial entity",
                            "datatype": "wikibase-item"
                        },
                        "statements": [
                            {
                                "value": {
                                    "type": "wbentityvariable",
                                    "columnName": "state_name_full"
                                },
                                "qualifiers": [],
                                "references": [
                                    {
                                        "snaks": [
                                            {
                                                "prop": {
                                                    "type": "wbpropconstant",
                                                    "pid": "P70",
                                                    "label": "knowledge source",
                                                    "datatype": "wikibase-item"
                                                },
                                                "value": {
                                                    "type": "wbentityvariable",
                                                    "columnName": "kb_source"
                                                }
                                            }
                                        ]
                                    }
                                ],
                                "mode": "add_or_merge",
                                "mergingStrategy": {
                                    "type": "snak",
                                    "valueMatcher": {
                                        "type": "lax"
                                    }
                                }
                            },
                            {
                                "value": {
                                    "type": "wbentityvariable",
                                    "columnName": "County"
                                },
                                "qualifiers": [],
                                "references": [
                                    {
                                        "snaks": [
                                            {
                                                "prop": {
                                                    "type": "wbpropconstant",
                                                    "pid": "P70",
                                                    "label": "knowledge source",
                                                    "datatype": "wikibase-item"
                                                },
                                                "value": {
                                                    "type": "wbentityvariable",
                                                    "columnName": "kb_source"
                                                }
                                            }
                                        ]
                                    }
                                ],
                                "mode": "add_or_merge",
                                "mergingStrategy": {
                                    "type": "snak",
                                    "valueMatcher": {
                                        "type": "lax"
                                    }
                                }
                            }
                        ]
                    },
                    {
                        "property": {
                            "type": "wbpropconstant",
                            "pid": "P6",
                            "label": "coordinate location",
                            "datatype": "globe-coordinate"
                        },
                        "statements": [
                            {
                                "value": {
                                    "type": "wblocationvariable",
                                    "columnName": "lat_lon_coords"
                                },
                                "qualifiers": [
                                    {
                                        "prop": {
                                            "type": "wbpropconstant",
                                            "pid": "P7",
                                            "label": "publication date",
                                            "datatype": "time"
                                        },
                                        "value": {
                                            "type": "wbdatevariable",
                                            "columnName": "Topo_Date"
                                        }
                                    },
                                    {
                                        "prop": {
                                            "type": "wbpropconstant",
                                            "pid": "P120",
                                            "label": "Remarks",
                                            "datatype": "string"
                                        },
                                        "value": {
                                            "type": "wbstringvariable",
                                            "columnName": "Remarks"
                                        }
                                    }
                                ],
                                "references": [
                                    {
                                        "snaks": [
                                            {
                                                "prop": {
                                                    "type": "wbpropconstant",
                                                    "pid": "P70",
                                                    "label": "knowledge source",
                                                    "datatype": "wikibase-item"
                                                },
                                                "value": {
                                                    "type": "wbentityvariable",
                                                    "columnName": "kb_source"
                                                }
                                            }
                                        ]
                                    }
                                ],
                                "mode": "add_or_merge",
                                "mergingStrategy": {
                                    "type": "snak",
                                    "valueMatcher": {
                                        "type": "lax"
                                    }
                                }
                            }
                        ]
                    },
                    {
                        "property": {
                            "type": "wbpropconstant",
                            "pid": "P118",
                            "label": "GDA ID",
                            "datatype": "external-id"
                        },
                        "statements": [
                            {
                                "value": {
                                    "type": "wbstringvariable",
                                    "columnName": "GDA_ID"
                                },
                                "qualifiers": [],
                                "references": [
                                    {
                                        "snaks": [
                                            {
                                                "prop": {
                                                    "type": "wbpropconstant",
                                                    "pid": "P70",
                                                    "label": "knowledge source",
                                                    "datatype": "wikibase-item"
                                                },
                                                "value": {
                                                    "type": "wbentityvariable",
                                                    "columnName": "kb_source"
                                                }
                                            }
                                        ]
                                    }
                                ],
                                "mode": "add_or_merge",
                                "mergingStrategy": {
                                    "type": "snak",
                                    "valueMatcher": {
                                        "type": "lax"
                                    }
                                }
                            }
                        ]
                    },
                    {
                        "property": {
                            "type": "wbpropconstant",
                            "pid": "P119",
                            "label": "Scan ID",
                            "datatype": "external-id"
                        },
                        "statements": [
                            {
                                "value": {
                                    "type": "wbstringvariable",
                                    "columnName": "ScanID"
                                },
                                "qualifiers": [],
                                "references": [
                                    {
                                        "snaks": [
                                            {
                                                "prop": {
                                                    "type": "wbpropconstant",
                                                    "pid": "P70",
                                                    "label": "knowledge source",
                                                    "datatype": "wikibase-item"
                                                },
                                                "value": {
                                                    "type": "wbentityvariable",
                                                    "columnName": "kb_source"
                                                }
                                            }
                                        ]
                                    }
                                ],
                                "mode": "add_or_merge",
                                "mergingStrategy": {
                                    "type": "snak",
                                    "valueMatcher": {
                                        "type": "lax"
                                    }
                                }
                            }
                        ]
                    },
                    {
                        "property": {
                            "type": "wbpropconstant",
                            "pid": "P1",
                            "label": "instance of",
                            "datatype": "wikibase-item"
                        },
                        "statements": [
                            {
                                "value": {
                                    "type": "wbentityvariable",
                                    "columnName": "Ftr_Type"
                                },
                                "qualifiers": [],
                                "references": [
                                    {
                                        "snaks": [
                                            {
                                                "prop": {
                                                    "type": "wbpropconstant",
                                                    "pid": "P70",
                                                    "label": "knowledge source",
                                                    "datatype": "wikibase-item"
                                                },
                                                "value": {
                                                    "type": "wbentityvariable",
                                                    "columnName": "kb_source"
                                                }
                                            }
                                        ]
                                    }
                                ],
                                "mode": "add_or_merge",
                                "mergingStrategy": {
                                    "type": "snak",
                                    "valueMatcher": {
                                        "type": "lax"
                                    }
                                }
                            }
                        ]
                    }
                ],
                "nameDescs": [
                    {
                        "type": "wbnamedescexpr",
                        "name_type": "LABEL_IF_NEW",
                        "value": {
                            "type": "wbmonolingualexpr",
                            "language": {
                                "type": "wblanguageconstant",
                                "id": "en",
                                "label": "en"
                            },
                            "value": {
                                "type": "wbstringvariable",
                                "columnName": "closest_item_label"
                            }
                        }
                    }
                ]
            }
        ],
        "siteIri": "https://geokb.wikibase.cloud/entity/",
        "entityTypeSiteIRI": {
            "item": "https://geokb.wikibase.cloud/entity/",
            "property": "https://geokb.wikibase.cloud/entity/",
            "mediainfo": "https://geokb.wikibase.cloud/entity/"
        },
        "mediaWikiApiEndpoint": "https://geokb.wikibase.cloud/w/api.php"
    }
}

Now that the schema is set the way we want, select WikibaseUpload edits to Wikibase. Log in if prompted. If there are warnings that pop up, then this means some adjustments to the values being inserted need to be modified to follow the correct format. Once all the warnings and errors are taken care of, add a summary in the textbox of the upsert being made, then click on Upload edits to add the data into the GeoKB.

Some final steps need to be taken to ensure the schema is saved and you have logged out of your account. To save a schema, click on the Save new button found in the schema section of the GUI. From there you can either save as a new schema by providing a name in the text box or we can overwrite an existing schema that was previously saved.

Warning: Simply clicking the Save schema button on the top right does not save your changes that have been made unless you loaded a previously saved schema. Any schema created from scratch will not be saved if overwritten.

To log out, go to Wikibase found in the extensions section on the top right, and select Manage Wikibase account. From there you can click the Log out on the pop up window.