Item talk:Q158190
USMIN Mining Districts found in the polygonal features
Purpose
This started with the need to correlate mining districts and relate them back to any USMIN mine data found in the Prospect- and Mine-Related Features from U.S. Geological Survey 7.5- and 15-Minute Topographic Quadrangle Maps of the United States (ver. 9.0, January 2023)
. To do this, various sources were looked into but it was determined that the best choice would be the records provided by the USMIN Mineral Deposit Database since it contains a polygons
layer with the mineral districts and their representative polgonal shapes.
Retrieval of Data
Using the WFS Version 1.1.0 URL endpoint found on the website and the Capabilities which includes the metadata and possible parameters for each layer, the polygons
layer was pulled into a Jupyter Notebook so that we can turn it into a DataFrame and preprocess it in a format that will be easy for the wbmaker to read. To do this, we use the Fiona python library to pull in the data. Some things to note is that reading a WFS is not a standard feature and needs to be set, which can be done by using the following:
fiona.drvsupport.supported_drivers['WFS'] = 'r'
From here we can use Fiona’s version of BytesCollection
to retrieve the features. Here is the full function to read the WFS:
def read_wfs(url, params):
with BytesCollection(requests.get(url,params=params).content) as f:
df = gpd.GeoDataFrame.from_features(f, crs='EPSG:4326')
return df
The parameters were provided by both using the Capabilities page provided by the dataset and by referring to the OGC documentation. Using those we can set the parameters as:
params = {
'request':'GetFeature',
'service':'WFS',
'version':'1.1.0',
'typeName':'polygons',
'srsName': 'urn:ogc:def:crs:EPSG::4326'
}
This returns the desired polygons
layer within the WFS.
Processing and cleaning
Once the data was pulled in to the notebook, there were some things that needed to be done before we can load the records into the GeoKB. First, the Polygons found in the geometry
column were incorrect in that the coordinates were in Y,X rather than the desired X,Y format. Looking into resources to fix this didn’t help all that much as there wasn’t a direct way to reverse the coordinates that worked properly (e.g. shapely’s reverse()
function found in the Polygon object returned an identical output), so after some thought this helper function was created:
def reverse_coords(geom: Polygon) -> Polygon:
coords = geom.exterior.coords.xy
reverse = tuple(zip(coords[1], coords[0]))
return Polygon(reverse)
Computationally I would say this is not efficient at all as it is creating a whole new Polygon object and replacing the old, but since there are < 1000 entries in the dataset, this can count as “good enough”. From here we call this function in a list comprehension using
[reverse_coords(x) for x in mining_districts['geometry']]
and assinging it back to the geometry
column to fix the issue.
Another issue found was that there are duplicate entries within the dataset. Using pandas, we can fix this quite easily (given that mining_districts
is the Dataframe we’re working with)
mining_districts = mining_districts.drop_duplicates(subset=['site_id'])
Another claim that would be of interest would be the centroid coordinate of the polygon geometry. After some testing of various methods found within the Shapely library, the WKT centroid function produces a more accurate result. Therefore, we get the centroid using the following:
from shapely import wkt
mining_districts['centroid'] = [wkt.loads(str(x)).centroid for x in mining_districts['geometry']]
From there, we set the desired label and aliases, if possible. The goal is to make sure that every label has ‘Mining Districts’ in the title to differentiate any other possible similar items within the GeoKB. We use the following to do that:
def fix_label(label: str) -> str:
if 'Mining District' not in label:
return label + ' Mining District'
if 'District' in label and not 'Mining District' in label:
return label.replace('District','Mining District')
return label
labels = [ fix_label(x) for x in mining_districts['ftr_name'] ]
mining_districts['label'] = labels
for the aliases, we only need to set the ones that dont have ‘Mining District’ in the name originally. This is done using
aliases = [ x if 'mining district' not in x.lower() else '' for x in mining_districts['ftr_name'] ]
mining_districts['aliases'] = aliases
Some relevant information to the mining districts are where they are located. Unfortunately, it did not have a county or state of where the districts lie, but by pulling in the 2020 Census Counties layer we can find a match between the polygons and add the resulting state and county information.
After passing in the id to the arcgis python library and querying the layer to get the results, we can see that the geometries are provided as either rings or multipolygons (by calling the as_shapely
function on the rings). To make sure that the shapes can be intersected properly, we first make sure that all shapes are valid (given that census_cts
is the census counties DataFrame):
census_cts['SHAPE'] = [shapely.make_valid(x.as_shapely) if not shapely.is_valid(x.as_shapely) else x.as_shapely for x in census_cts['SHAPE']]
mining_districts['geometry'] = [shapely.make_valid(x) if not shapely.is_valid(x) else x for x in mining_districts['geometry']]
Then for each geometry shape in mining_districts
, we find an intersect of it and the county shape and return a tuple containing the state and county and create new columns in the DataFrame:
def match_shapes(poly: Polygon, rows: pd.DataFrame) -> tuple:
if not shapely.is_valid(poly):
poly = shapely.make_valid(poly)
for i in range(len(rows)):
row = rows.iloc[i]
multipoly = row['SHAPE']
if poly.intersects(multipoly):
return row['STATE_NAME'], row['NAME']
return '', ''
county_state = []
for i in range(len(mining_districts)):
district = mining_districts.iloc[i]
ct_st_match = match_shapes(district['geometry'], q)
# if either state or county is '', then match not found
if '' in ct_st_match:
print('Shape Match not found for provided Polygon\n',f'District not matched: {district["ftr_name"]}')
county_state.append(ct_st_match)
mining_districts['state'] = [x[0] for x in county_state]
mining_districts['county'] = [x[1] for x in county_state]
One last thing that needs to be fixed is the dates found in the last_updt
column of mining_districts
. We notice that there is one entry that is not in the correct format (i.e. YYYY-M-DD instead of YYYY-MM-DD), and although this can be fixed manually it would be better to have more forward thinking in case there happen to be any other datapoints that end up this way. A fix_time
function was created and the replacement is made as follows:
import datetime
def fix_time(t:str, t_format: str) -> str:
# converts to YYYY-MM-DD format
d = datetime.datetime.strptime(t, t_format)
return d.strftime('%Y-%m-%d')
new_time = [fix_time(x, '%Y-%m-%d') if not re.search(r'\d\d\d\d-\d\d-\d\d', x) else x for x in mining_districts['last_updt']]
mining_districts['last_updt'] = new_time