Item talk:Q158190

From geokb

USMIN Mining Districts found in the polygonal features

Purpose

This started with the need to correlate mining districts and relate them back to any USMIN mine data found in the Prospect- and Mine-Related Features from U.S. Geological Survey 7.5- and 15-Minute Topographic Quadrangle Maps of the United States (ver. 9.0, January 2023). To do this, various sources were looked into but it was determined that the best choice would be the records provided by the USMIN Mineral Deposit Database since it contains a polygons layer with the mineral districts and their representative polgonal shapes.

Retrieval of Data

Using the WFS Version 1.1.0 URL endpoint found on the website and the Capabilities which includes the metadata and possible parameters for each layer, the polygons layer was pulled into a Jupyter Notebook so that we can turn it into a DataFrame and preprocess it in a format that will be easy for the wbmaker to read. To do this, we use the Fiona python library to pull in the data. Some things to note is that reading a WFS is not a standard feature and needs to be set, which can be done by using the following:

fiona.drvsupport.supported_drivers['WFS'] = 'r'

From here we can use Fiona’s version of BytesCollection to retrieve the features. Here is the full function to read the WFS:

def read_wfs(url, params):
    with BytesCollection(requests.get(url,params=params).content) as f:
        df = gpd.GeoDataFrame.from_features(f, crs='EPSG:4326')
    return df

The parameters were provided by both using the Capabilities page provided by the dataset and by referring to the OGC documentation. Using those we can set the parameters as:

params = {
    'request':'GetFeature',
    'service':'WFS',
    'version':'1.1.0',
    'typeName':'polygons',
    'srsName': 'urn:ogc:def:crs:EPSG::4326'
}

This returns the desired polygons layer within the WFS.

Processing and cleaning

Once the data was pulled in to the notebook, there were some things that needed to be done before we can load the records into the GeoKB. First, the Polygons found in the geometry column were incorrect in that the coordinates were in Y,X rather than the desired X,Y format. Looking into resources to fix this didn’t help all that much as there wasn’t a direct way to reverse the coordinates that worked properly (e.g. shapely’s reverse() function found in the Polygon object returned an identical output), so after some thought this helper function was created:

def reverse_coords(geom: Polygon) -> Polygon:
    coords = geom.exterior.coords.xy
    reverse = tuple(zip(coords[1], coords[0]))
    return Polygon(reverse)

Computationally this is not efficient at all as it is creating a whole new Polygon object and replacing the old, but since there are < 1000 entries in the dataset, this can count as “good enough”. From here we call this function in a list comprehension using

[reverse_coords(x) for x in mining_districts['geometry']]

and assinging it back to the geometry column to fix the issue.

Another issue found was that there are duplicate entries within the dataset. Using pandas, we can fix this quite easily (given that mining_districts is the Dataframe we’re working with)

mining_districts = mining_districts.drop_duplicates(subset=['site_id'])

Another claim that would be of interest would be the centroid coordinate of the polygon geometry. After some testing of various methods found within the Shapely library, the WKT centroid function produces a more accurate result. Therefore, we get the centroid using the following:

from shapely import wkt
mining_districts['centroid'] = [wkt.loads(str(x)).centroid for x in mining_districts['geometry']]

According to the wbi GlobeCoordinates docs the coordinate provided needs to be latitude and longitude values separately. The coordinates can be extracted by accessing the properties of the newly created WKT

mining_districts['centroid_lon'] = [row.x for row in mining_districts['centroid']]
mining_districts['centroid_lat'] = [row.y for row in mining_districts['centroid']]

From there, we set the desired label and aliases, if possible. The goal is to make sure that every label has ‘Mining Districts’ in the title to differentiate any other possible similar items within the GeoKB. We use the following to do that:

def fix_label(label: str) -> str:
    if 'Mining District' not in label:
        return label  + ' Mining District'
    if 'District' in label and not 'Mining District' in label:
        return label.replace('District','Mining District')
    return label

labels = [ fix_label(x) for x in mining_districts['ftr_name'] ]
mining_districts['label'] = labels

for the aliases, we only need to set the ones that dont have ‘Mining District’ in the name originally. This is done using

aliases = [ x if 'mining district' not in x.lower() else '' for x in mining_districts['ftr_name'] ]
mining_districts['aliases'] = aliases

Some relevant information to the mining districts are where they are located. Unfortunately, it did not have a county or state of where the districts lie, but by pulling in the 2020 Census Counties layer we can find a match between the polygons and add the resulting state and county information.

After passing in the id to the arcgis python library and querying the layer to get the results, we can see that the geometries are provided as either rings or multipolygons (by calling the as_shapely function on the rings). To make sure that the shapes can be intersected properly, we first make sure that all shapes are valid (given that census_cts is the census counties DataFrame):

census_cts['SHAPE'] = [shapely.make_valid(x.as_shapely) if not shapely.is_valid(x.as_shapely) else x.as_shapely for x in census_cts['SHAPE']]
mining_districts['geometry'] = [shapely.make_valid(x) if not shapely.is_valid(x) else x for x in mining_districts['geometry']]

Then for each geometry shape in mining_districts, we find an intersect of it and the county shape and return a tuple containing the state and county and create new columns in the DataFrame:

def match_shapes(poly: Polygon, rows: pd.DataFrame) -> tuple:
    if not shapely.is_valid(poly):
        poly = shapely.make_valid(poly)
    for i in range(len(rows)):
        row = rows.iloc[i]
        multipoly = row['SHAPE']
        if poly.intersects(multipoly):
            return row['STATE_NAME'], row['NAME']
    return '', ''
county_state = []
for i in range(len(mining_districts)):
    district = mining_districts.iloc[i]
    ct_st_match = match_shapes(district['geometry'], q)
    # if either state or county is '', then match not found
    if '' in ct_st_match:
        print('Shape Match not found for provided Polygon\n',f'District not matched: {district["ftr_name"]}')
    county_state.append(ct_st_match)

mining_districts['state'] = [x[0] for x in county_state]
mining_districts['county'] = [x[1] for x in county_state]

WikibaseIntegrator expects QIDs whenever the property datatype is set to wikibase-item. It is not clear whether the counties have been entered into the GeoKB. Two approaches were taken to see if any additional work had to be done before the QID of the counties can be found:

  • The name is taken from the county column (didn’t matter which one) and pasted it in the search bar of the GUI. Usually this will pop up the relevant item since it uses ElasticSearch, but if too much was returned in the results it was possible to narrow down what’s found using the Search In dropdown menu found here by setting it so it only finds Items.
  • To verify, we can open the U.S. county Item page and look at the right hand part to select What links here. From what can be seen here is that there is that all the counties have been added, but the titles were more descriptive than just the county so that there is no confusion with counties having the same name

What this means is that now that we know the counties all exist, we need to find the corresponding id for the title of interest. To do this a general item search function was created

def item_search(label: str, instance_of: str, bot_name: str) -> str:
  sparql_endpoint = os.environ[f'WB_SPARQL_{bot_name}']
  query = f'''PREFIX wdt: <https://geokb.wikibase.cloud/prop/direct/>
  PREFIX wd:  <https://geokb.wikibase.cloud/entity/>
  SELECT ?item
  WHERE {{
    ?item rdfs:label ?label ;
       wdt:P1 wd:{instance_of} .
    FILTER CONTAINS( LCASE(?label), "{label.lower()}") .

    SERVICE wikibase:label {{ bd:serviceParam wikibase:language "en" . }}
  }}
  '''

  params = {
      'query': query,
      'format': 'json'
  }

  res = requests.get(sparql_endpoint, params=params, timeout=100)
  json_res =res.json()
  item_result = (json_res['results']['bindings'][0]['item']['value']
                  if 'results' in json_res
                  and len(json_res['results']['bindings']) > 0
                  and 'item' in json_res['results']['bindings'][0]
                  else None)
  return item_result.split('/')[-1] if item_result is not None else None

What this does is that it makes a SPARQL request to find the item QID with the restriction that it must have the instance of QID in its page. Once found, it will either parse the returned URI so that it strips everything except the QID, or if it not found returns None. The bot name parameter being passed in will be the same name that is expected whenever the WikibaseConnection class is instantiated. For example, to find the county QIDs (given that the instance of the U.S. State is Q481), we add a column using a list comprehension

county_instance_of = 'Q481'
county_items = [
    item_search(f'{county}, {state}', county_instance_of, name) 
    for (county,state) 
    in zip(mining_districts['county'], mining_districts['state'])
]
mining_districts['county_qid'] = county_items

Notice this zips up the county and state elementwise so that the full county name can match with what is found within the GeoKB. One entry within the DataFrame (the one with the feature name as St. Louis County) cannot find the QID as is due to the abbreviation. To fix that we run a regex replacement on the column, ensuring that it matches with the . and the regex being read is passed in as a raw string

mining_districts['county'].replace(r'St\.', 'Saint',regex=True, inplace=True)

Running the previous code will now return the QID matches in its entirety.

A similar approach was taken to get the QID of the states, by finding the U.S. State QID to use as the instance of and passing it into the item_search function

state_instance_of = 'Q229'
state_items = [ item_search(state, state_instance_of, name) for state in mining_districts['state'] ]
mining_districts['state_qid'] = state_items

One last thing that needs to be fixed is the dates found in the last_updt column of mining_districts. We notice that there is one entry that is not in the correct format (i.e. YYYY-M-DD instead of YYYY-MM-DD), and although this can be fixed manually it would be better to have more forward thinking in case there happen to be any other datapoints that end up this way. A fix_time function was created and the replacement is made as follows:

import datetime
def fix_time(t:str, t_format: str) -> str:
    # converts to YYYY-MM-DD format
    d = datetime.datetime.strptime(t, t_format)
    return d.strftime('%Y-%m-%d')

new_time = [fix_time(x, '%Y-%m-%d') if not re.search(r'\d\d\d\d-\d\d-\d\d', x) else x for x in mining_districts['last_updt']]
mining_districts['last_updt'] = new_time

Inserting Into the GeoKB

Using the Wikibaseintegrator docs

The items were created using the correct parameters for each label, description, and claim.

Each property value needed to associate the QIDs to were found within the bot, under the prop_lookup property dictionary that is automatically store in the class.

Each action was declared beforehand so that it will be easy to adjust within the item creation as needed. The docs for all the options available are found here.