Item talk:Q158190: Difference between revisions

no edit summary
(Created page with "<span id="usmin-mining-districts-found-in-the-polygonal-features"></span> = USMIN Mining Districts found in the polygonal features = <span id="purpose"></span> == Purpose == This started with the need to correlate mining districts and relate them back to any USMIN mine data found in the <code>Prospect- and Mine-Related Features from U.S. Geological Survey 7.5- and 15-Minute Topographic Quadrangle Maps of the United States (ver. 9.0, January 2023)</code>. To do this, va...")
 
No edit summary
Line 51: Line 51:
<syntaxhighlight lang="python">from shapely import wkt
<syntaxhighlight lang="python">from shapely import wkt
mining_districts['centroid'] = [wkt.loads(str(x)).centroid for x in mining_districts['geometry']]</syntaxhighlight>
mining_districts['centroid'] = [wkt.loads(str(x)).centroid for x in mining_districts['geometry']]</syntaxhighlight>
According to the [https://wikibaseintegrator.readthedocs.io/en/stable/wikibaseintegrator.datatypes.globecoordinate.html wbi GlobeCoordinates docs] the coordinate provided needs to be latitude and longitude values separately. The coordinates can be extracted by accessing the properties of the newly created WKT
<syntaxhighlight lang="python">mining_districts['centroid_lon'] = [row.x for row in mining_districts['centroid']]
mining_districts['centroid_lat'] = [row.y for row in mining_districts['centroid']]</syntaxhighlight>
From there, we set the desired label and aliases, if possible. The goal is to make sure that every label has ‘Mining Districts’ in the title to differentiate any other possible similar items within the GeoKB. We use the following to do that:
From there, we set the desired label and aliases, if possible. The goal is to make sure that every label has ‘Mining Districts’ in the title to differentiate any other possible similar items within the GeoKB. We use the following to do that:


Line 94: Line 98:
mining_districts['state'] = [x[0] for x in county_state]
mining_districts['state'] = [x[0] for x in county_state]
mining_districts['county'] = [x[1] for x in county_state]</syntaxhighlight>
mining_districts['county'] = [x[1] for x in county_state]</syntaxhighlight>
WikibaseIntegrator expects QIDs whenever the property datatype is set to <code>wikibase-item</code>. Originally I wasn’t sure whether the counties have been entered into the GeoKB. Two approaches were taken to see if any additional work had to be done before I can find the QID of the counties:
* I took a name from the <code>county</code> column (didn’t matter which one) and pasted it in the search bar of the GUI. Usually this will pop up the relevant item since it uses ElasticSearch, but if too much was returned in the results it was possible to narrow down what’s found using the <code>Search In</code> dropdown menu found [https://geokb.wikibase.cloud/w/index.php?search=&search=&title=Special%3ASearch&go=Go here] by setting it so it only finds Items.
* I pulled up the <code>U.S. county</code> [https://geokb.wikibase.cloud/wiki/Item:Q481 Item page] and looked at the right hand part to select <code>What links here</code>. From what I saw there is that all the counties have been added, but the titles were more descriptive than just the county so that there is no confusion with counties having the same name
What this means is that now that we know the counties all exist, we need to find the corresponding id for the title of interest. To do this a general item search function was created
<syntaxhighlight lang="python">def item_search(label: str, instance_of: str, bot_name: str) -> str:
  sparql_endpoint = os.environ[f'WB_SPARQL_{bot_name}']
  query = f'''PREFIX wdt: <https://geokb.wikibase.cloud/prop/direct/>
  PREFIX wd:  <https://geokb.wikibase.cloud/entity/>
  SELECT ?item
  WHERE {{
    ?item rdfs:label ?label ;
      wdt:P1 wd:{instance_of} .
    FILTER CONTAINS( LCASE(?label), "{label.lower()}") .
    SERVICE wikibase:label {{ bd:serviceParam wikibase:language "en" . }}
  }}
  '''
  params = {
      'query': query,
      'format': 'json'
  }
  res = requests.get(sparql_endpoint, params=params, timeout=100)
  json_res =res.json()
  item_result = (json_res['results']['bindings'][0]['item']['value']
                  if 'results' in json_res
                  and len(json_res['results']['bindings']) > 0
                  and 'item' in json_res['results']['bindings'][0]
                  else None)
  return item_result.split('/')[-1] if item_result is not None else None</syntaxhighlight>
What this does is that it makes a SPARQL request to find the item QID with the restriction that it must have the instance of QID in its page. Once found, it will either parse the returned URI so that it strips everything except the QID, or if it not found returns None. The bot name parameter being passed in will be the same name that is expected whenever the <code>WikibaseConnection</code> class is instantiated. For example, to find the county QIDs (given that the instance of the U.S. State is <code>Q481</code>), we add a column using a list comprehension
<syntaxhighlight lang="python">county_instance_of = 'Q481'
county_items = [
    item_search(f'{county}, {state}', county_instance_of, name)
    for (county,state)
    in zip(mining_districts['county'], mining_districts['state'])
]
mining_districts['county_qid'] = county_items</syntaxhighlight>
Notice this zips up the county and state elementwise so that the full county name can match with what is found within the GeoKB. One entry within the DataFrame (the one with the feature name as St. Louis County) cannot find the QID as is due to the abbreviation. To fix that we run a regex replacement on the column, ensuring that it matches with the <code>.</code> and the regex being read is passed in as a raw string
<syntaxhighlight lang="python">mining_districts['county'].replace(r'St\.', 'Saint',regex=True, inplace=True)</syntaxhighlight>
Running the previous code will now return the QID matches in its entirety.
A similar approach was taken to get the QID of the states, by finding the <code>U.S. State</code> QID to use as the instance of and passing it into the <code>item_search</code> function
<syntaxhighlight lang="python">state_instance_of = 'Q229'
state_items = [ item_search(state, state_instance_of, name) for state in mining_districts['state'] ]
mining_districts['state_qid'] = state_items</syntaxhighlight>
One last thing that needs to be fixed is the dates found in the <code>last_updt</code> column of <code>mining_districts</code>. We notice that there is one entry that is not in the correct format (i.e. YYYY-M-DD instead of YYYY-MM-DD), and although this can be fixed manually it would be better to have more forward thinking in case there happen to be any other datapoints that end up this way. A <code>fix_time</code> function was created and the replacement is made as follows:
One last thing that needs to be fixed is the dates found in the <code>last_updt</code> column of <code>mining_districts</code>. We notice that there is one entry that is not in the correct format (i.e. YYYY-M-DD instead of YYYY-MM-DD), and although this can be fixed manually it would be better to have more forward thinking in case there happen to be any other datapoints that end up this way. A <code>fix_time</code> function was created and the replacement is made as follows:


Line 104: Line 161:
new_time = [fix_time(x, '%Y-%m-%d') if not re.search(r'\d\d\d\d-\d\d-\d\d', x) else x for x in mining_districts['last_updt']]
new_time = [fix_time(x, '%Y-%m-%d') if not re.search(r'\d\d\d\d-\d\d-\d\d', x) else x for x in mining_districts['last_updt']]
mining_districts['last_updt'] = new_time</syntaxhighlight>
mining_districts['last_updt'] = new_time</syntaxhighlight>
<span id="wbmaker-setup"></span>
<span id="inserting-into-the-geokb"></span>
== Inserting Into the GeoKB ==
 
Using the Wikibaseintegrator docs
 
* [https://wikibaseintegrator.readthedocs.io/en/stable/wikibaseintegrator.datatypes.html# Datatypes]
* [https://wikibaseintegrator.readthedocs.io/en/stable/wikibaseintegrator.datatypes.item.html Item]
* [https://wikibaseintegrator.readthedocs.io/en/stable/wikibaseintegrator.datatypes.externalid.html ExternalID]
* [https://wikibaseintegrator.readthedocs.io/en/stable/wikibaseintegrator.datatypes.globecoordinate.html globecoordinate]
* [https://wikibaseintegrator.readthedocs.io/en/stable/wikibaseintegrator.datatypes.time.html Time]
* [https://wikibaseintegrator.readthedocs.io/en/stable/wikibaseintegrator.datatypes.url.html URL]
 
The items were created using the correct parameters for each label, description, and claim.
 
Each property value needed to associate the QIDs to were found within the bot, under the <code>prop_lookup</code> property dictionary that is automatically store in the class.
 
Each action was declared beforehand so that it will be easy to adjust within the item creation as needed. The docs for all the options available are found [https://wikibaseintegrator.readthedocs.io/en/stable/wikibaseintegrator.wbi_enums.html#wikibaseintegrator.wbi_enums.ActionIfExists here].