Reputation: 53
I've got a list of over 1mm lat/long pairs that I'm looking to reverse geocode using the reverse-geocoder library. I've got my dataframe set up with columns for lat, long, and a pair which I constructed as (lat, long).
My code:
import numpy as np
import scipy
import pandas as pdimport reverse_geocoder as rg
df = pd.read_csv ('reverse.csv')
This provides a dataframe:
Unnamed: 0 email lat long
0 1 a@email.com 42.36 -71.07
1 2 b@email.com 37.72 -122.44
2 3 c@email.com 40.83 -74.25
3 4 d@email.com 41.70 -70.22
4 5 f@email.com 34.27 -118.71
Realizing I need a few more columns, I add the following:
df['reversePair'] = list(zip(df.lat, df.long))
Which creates a datacolumn and has values such as (42.36, -71.07)
I then added additional columns:
df=df.assign(city="")
df=df.assign(state="")
df=df.assign(country="")
All's well up to this point. I then try to run the function row by row (there has to be a better way, especially since the package can take multiple tuples but I've struggled at this point already):
(Note for the below, limiting to two records for example purposes)
import itertools
n = 2
for item in itertools.islice(df['reversePair'], n):
print(item)
x = rg.search(item)
for i in x:
print(i["name"])
print(i["admin1"])
print(i["cc"])
df["city"] = i["name"]
df["state"] = i["admin1"]
df["country"] = i["cc"]
df.head(2)
This then gives me the following correct results, but in the dataset it is only adding (overwriting?) the last value returned into both records. It's driving me crazy.
(42.36, -71.07)
Boston
Massachusetts
US
(37.72, -122.44)
Daly City
California
US
Unnamed: 0 email lat long city reversePair state country
0 1.0 a@email.com 42.36 -71.07 Daly City (42.36, -71.07) California US
1 2.0 b@email.com 37.72 -122.44 Daly City (37.72, -122.44) California US
Row 0 should be city = Boston, state = Massachusetts, country = US (it is, but assuming it's from the wrong record).
Any ideas on where I'm going wrong with this would be much appreciated. Also, if anyone can see a way to make it faster, that would also be very much appreciated! Thank you!
Upvotes: 0
Views: 176
Reputation: 29635
Your problem is at each loop you assign the value to the full column as you don't select which row. To fix, it would be something like df.loc[idx, "city"] = i["name"]
, where idx
would be the index value of the row of the item
(except you don't extract it currently).
Anyway, I don't have the library installed, but of what I read on the link provided, I believe you can do without a loop for
, directly pass all the coordinates couples in rg.search
.
# create a dataframe from the result of rg.search
res = pd.DataFrame(
# you can pass a tuple of several two coordinates tuple at once
rg.search(tuple(df['reversePair'].tolist())),
# use the same index than df for later join or column assignation
index=df.index)
print(res.head()) # to have an idea of what it is
Then you can either
# use join if column's names don't matter too much
new_df = df.join(res[['name','admin1','cc']])
or
# just add the column with the name you wanted
df['city'] = res['name']
df['state'] = res['admin1']
df["country"] = res["cc"]
Upvotes: 0