Item talk:Q158190

From geokb
Revision as of 23:56, 18 August 2023 by Joliveros (talk | contribs) (Created page with "<span id="usmin-mining-districts-found-in-the-polygonal-features"></span> = USMIN Mining Districts found in the polygonal features = <span id="purpose"></span> == Purpose == This started with the need to correlate mining districts and relate them back to any USMIN mine data found in the <code>Prospect- and Mine-Related Features from U.S. Geological Survey 7.5- and 15-Minute Topographic Quadrangle Maps of the United States (ver. 9.0, January 2023)</code>. To do this, va...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

USMIN Mining Districts found in the polygonal features

Purpose

This started with the need to correlate mining districts and relate them back to any USMIN mine data found in the Prospect- and Mine-Related Features from U.S. Geological Survey 7.5- and 15-Minute Topographic Quadrangle Maps of the United States (ver. 9.0, January 2023). To do this, various sources were looked into but it was determined that the best choice would be the records provided by the USMIN Mineral Deposit Database since it contains a polygons layer with the mineral districts and their representative polgonal shapes.

Retrieval of Data

Using the WFS Version 1.1.0 URL endpoint found on the website and the Capabilities which includes the metadata and possible parameters for each layer, the polygons layer was pulled into a Jupyter Notebook so that we can turn it into a DataFrame and preprocess it in a format that will be easy for the wbmaker to read. To do this, we use the Fiona python library to pull in the data. Some things to note is that reading a WFS is not a standard feature and needs to be set, which can be done by using the following:

fiona.drvsupport.supported_drivers['WFS'] = 'r'

From here we can use Fiona’s version of BytesCollection to retrieve the features. Here is the full function to read the WFS:

def read_wfs(url, params):
    with BytesCollection(requests.get(url,params=params).content) as f:
        df = gpd.GeoDataFrame.from_features(f, crs='EPSG:4326')
    return df

The parameters were provided by both using the Capabilities page provided by the dataset and by referring to the OGC documentation. Using those we can set the parameters as:

params = {
    'request':'GetFeature',
    'service':'WFS',
    'version':'1.1.0',
    'typeName':'polygons',
    'srsName': 'urn:ogc:def:crs:EPSG::4326'
}

This returns the desired polygons layer within the WFS.

Processing and cleaning

Once the data was pulled in to the notebook, there were some things that needed to be done before we can load the records into the GeoKB. First, the Polygons found in the geometry column were incorrect in that the coordinates were in Y,X rather than the desired X,Y format. Looking into resources to fix this didn’t help all that much as there wasn’t a direct way to reverse the coordinates that worked properly (e.g. shapely’s reverse() function found in the Polygon object returned an identical output), so after some thought this helper function was created:

def reverse_coords(geom: Polygon) -> Polygon:
    coords = geom.exterior.coords.xy
    reverse = tuple(zip(coords[1], coords[0]))
    return Polygon(reverse)

Computationally I would say this is not efficient at all as it is creating a whole new Polygon object and replacing the old, but since there are < 1000 entries in the dataset, this can count as “good enough”. From here we call this function in a list comprehension using

[reverse_coords(x) for x in mining_districts['geometry']]

and assinging it back to the geometry column to fix the issue.

Another issue found was that there are duplicate entries within the dataset. Using pandas, we can fix this quite easily (given that mining_districts is the Dataframe we’re working with)

mining_districts = mining_districts.drop_duplicates(subset=['site_id'])

Another claim that would be of interest would be the centroid coordinate of the polygon geometry. After some testing of various methods found within the Shapely library, the WKT centroid function produces a more accurate result. Therefore, we get the centroid using the following:

from shapely import wkt
mining_districts['centroid'] = [wkt.loads(str(x)).centroid for x in mining_districts['geometry']]

From there, we set the desired label and aliases, if possible. The goal is to make sure that every label has ‘Mining Districts’ in the title to differentiate any other possible similar items within the GeoKB. We use the following to do that:

def fix_label(label: str) -> str:
    if 'Mining District' not in label:
        return label  + ' Mining District'
    if 'District' in label and not 'Mining District' in label:
        return label.replace('District','Mining District')
    return label

labels = [ fix_label(x) for x in mining_districts['ftr_name'] ]
mining_districts['label'] = labels

for the aliases, we only need to set the ones that dont have ‘Mining District’ in the name originally. This is done using

aliases = [ x if 'mining district' not in x.lower() else '' for x in mining_districts['ftr_name'] ]
mining_districts['aliases'] = aliases

Some relevant information to the mining districts are where they are located. Unfortunately, it did not have a county or state of where the districts lie, but by pulling in the 2020 Census Counties layer we can find a match between the polygons and add the resulting state and county information.

After passing in the id to the arcgis python library and querying the layer to get the results, we can see that the geometries are provided as either rings or multipolygons (by calling the as_shapely function on the rings). To make sure that the shapes can be intersected properly, we first make sure that all shapes are valid (given that census_cts is the census counties DataFrame):

census_cts['SHAPE'] = [shapely.make_valid(x.as_shapely) if not shapely.is_valid(x.as_shapely) else x.as_shapely for x in census_cts['SHAPE']]
mining_districts['geometry'] = [shapely.make_valid(x) if not shapely.is_valid(x) else x for x in mining_districts['geometry']]

Then for each geometry shape in mining_districts, we find an intersect of it and the county shape and return a tuple containing the state and county and create new columns in the DataFrame:

def match_shapes(poly: Polygon, rows: pd.DataFrame) -> tuple:
    if not shapely.is_valid(poly):
        poly = shapely.make_valid(poly)
    for i in range(len(rows)):
        row = rows.iloc[i]
        multipoly = row['SHAPE']
        if poly.intersects(multipoly):
            return row['STATE_NAME'], row['NAME']
    return '', ''
county_state = []
for i in range(len(mining_districts)):
    district = mining_districts.iloc[i]
    ct_st_match = match_shapes(district['geometry'], q)
    # if either state or county is '', then match not found
    if '' in ct_st_match:
        print('Shape Match not found for provided Polygon\n',f'District not matched: {district["ftr_name"]}')
    county_state.append(ct_st_match)

mining_districts['state'] = [x[0] for x in county_state]
mining_districts['county'] = [x[1] for x in county_state]

One last thing that needs to be fixed is the dates found in the last_updt column of mining_districts. We notice that there is one entry that is not in the correct format (i.e. YYYY-M-DD instead of YYYY-MM-DD), and although this can be fixed manually it would be better to have more forward thinking in case there happen to be any other datapoints that end up this way. A fix_time function was created and the replacement is made as follows:

import datetime
def fix_time(t:str, t_format: str) -> str:
    # converts to YYYY-MM-DD format
    d = datetime.datetime.strptime(t, t_format)
    return d.strftime('%Y-%m-%d')

new_time = [fix_time(x, '%Y-%m-%d') if not re.search(r'\d\d\d\d-\d\d-\d\d', x) else x for x in mining_districts['last_updt']]
mining_districts['last_updt'] = new_time