Reputation: 176
Suppose I have this df_atm
:
borough Longitude Latitude
0 bronx 40.79 -73.78
1 manhattan 40.78 -73.90
2 staten island 40.84 -73.95
3 NaN 40.57 -74.11
Every row represents an ATM withdrawal.
I hope to generate value for missing value based on the coordinate inside the Longitude and Latitude columns.
borough Longitude Latitude
0 bronx 40.79 -73.78
1 manhattan 40.78 -73.90
2 staten island 40.84 -73.95
3 staten island 40.57 -74.11
Since coordinate [40.57, -74.11] are inside Staten Island's borough.
I have generated a dict with boroughs' coordinates:
borough_dict = {"Bronx" : [40.837048, -73.865433], "Brooklyn" : [40.650002, -73.949997], "Manhattan" : [40.758896, -73.985130], "Queens" : [40.742054,-73.769417], "Staten Island" : [40.579021,-74.151535]}
And this is what I try so far (code/pseudocode):
df_atm['borough'] = df_atm.apply(
lambda row: **idk what do to here** if np.isnan(row['borough']) else row['borough'],
axis=1
)
Many thanks in advance!
Upvotes: 2
Views: 439
Reputation: 59519
You want a spatial
join, so use the very closely related GeoPandas
library. We'll convert your original DataFrame to a GeoDataFrame so that we can merge. Also note in your example your Latitude
and Longitude
columns are incorrectly labeled. I fixed that here.
import pandas as pd
import geopandas as gpd
dfg = gpd.GeoDataFrame(df.copy(), geometry=gpd.points_from_xy(df.Longitude, df.Latitude))
# borough Latitude Longitude geometry
#0 bronx 40.79 -73.78 POINT (-73.78000 40.79000)
#1 manhattan 40.78 -73.90 POINT (-73.90000 40.78000)
#2 staten island 40.84 -73.95 POINT (-73.95000 40.84000)
#3 NaN 40.57 -74.11 POINT (-74.11000 40.57000)
# Shapefile from https://geo.nyu.edu/catalog/nyu-2451-34154
# I downloaded the geojson
df_nys = gpd.read_file('nyu-2451-34154-geojson.json')
dfg.crs = df_nys.crs # Set coordinate reference system to be the same
dfg = gpd.sjoin(dfg, df_nys[['geometry', 'boroname']], how='left', op='within')
borough Latitude Longitude geometry index_right boroname
0 bronx 40.79 -73.78 POINT (-73.78000 40.79000) 4.0 Queens
1 manhattan 40.78 -73.90 POINT (-73.90000 40.78000) 4.0 Queens
2 staten island 40.84 -73.95 POINT (-73.95000 40.84000) NaN NaN
3 NaN 40.57 -74.11 POINT (-74.11000 40.57000) 2.0 Staten Island
So now you could fill missing 'borough'
s with 'boroname'
. But it does seem like a few of the other points are miss-classified. This is mostly because you don't have enough precision on your stored Latitude and Longitude. Though this would probably be the more accurate solution with better precision on Lat/Lon, I might favor the distance calculation by @adnanmuttaleb given the level of precision you have in your data.
Upvotes: 1
Reputation: 3604
Try this :
from math import cos, asin, sqrt
import pandas as pd
def distance(lat1, lon1, lat2, lon2):
p = 0.017453292519943295
a = 0.5 - cos((lat2-lat1)*p)/2 + cos(lat1*p)*cos(lat2*p) * (1-cos((lon2-lon1)*p)) / 2
return 12742 * asin(sqrt(a))
def closest(data, v):
return min(data, key=lambda p: distance(v[0], v[1], p[0], p[1]))
df = pd.DataFrame(
[
{'borough': 'bronx', 'lat': 40.79, 'long': -73.78},
{'borough': 'manhattan', 'lat': 40.78, 'long': -73.90},
{'borough': None, 'lat': 40.57, 'long': -74.11}
],
)
borough_dict = {"Bronx" : [40.837048, -73.865433], "Brooklyn" : [40.650002, -73.949997], "Manhattan" : [40.758896, -73.985130], "Queens" : [40.742054,-73.769417], "Staten Island" : [40.579021,-74.151535]}
boroughs = [(*value, key) for key, value in borough_dict.items()]
df['borough'] = df.apply(
lambda row: closest(boroughs, [row['lat'], row['long']])[2] if row['borough'] is None else row['borough'],
axis=1
)
print(df)
Output:
borough lat long
0 bronx 40.79 -73.78
1 manhattan 40.78 -73.90
2 Staten Island 40.57 -74.11
Credit to @trincot answer
Upvotes: 3