grega
grega

Reputation: 53

Problem running a python function against a pandas dataframe and returning values into the dataframe

I've got a list of over 1mm lat/long pairs that I'm looking to reverse geocode using the reverse-geocoder library. I've got my dataframe set up with columns for lat, long, and a pair which I constructed as (lat, long).

My code:

import numpy as np
import scipy
import pandas as pdimport reverse_geocoder as rg

df = pd.read_csv ('reverse.csv')

This provides a dataframe:

      Unnamed:  0      email        lat     long
   0      1            a@email.com  42.36   -71.07  
   1      2            b@email.com  37.72   -122.44 
   2      3            c@email.com  40.83   -74.25
   3      4            d@email.com  41.70   -70.22
   4      5            f@email.com  34.27   -118.71

Realizing I need a few more columns, I add the following:

df['reversePair'] = list(zip(df.lat, df.long))

Which creates a datacolumn and has values such as (42.36, -71.07)

I then added additional columns:

df=df.assign(city="")
df=df.assign(state="")
df=df.assign(country="")

All's well up to this point. I then try to run the function row by row (there has to be a better way, especially since the package can take multiple tuples but I've struggled at this point already):

(Note for the below, limiting to two records for example purposes)

import itertools

n = 2

for item in itertools.islice(df['reversePair'], n):
    print(item) 
    x = rg.search(item)
    for i in x:
        print(i["name"])
        print(i["admin1"])
        print(i["cc"])
    
        df["city"] = i["name"]
        df["state"] = i["admin1"]
        df["country"] = i["cc"]
    
df.head(2)

This then gives me the following correct results, but in the dataset it is only adding (overwriting?) the last value returned into both records. It's driving me crazy.

    (42.36, -71.07)
    Boston
    Massachusetts
    US
    (37.72, -122.44)
    Daly City
    California
    US
      Unnamed: 0    email       lat     long    city    reversePair state     country
   0    1.0     a@email.com 42.36   -71.07  Daly City   (42.36, -71.07)   California    US
   1    2.0     b@email.com 37.72   -122.44 Daly City   (37.72, -122.44)  California    US

Row 0 should be city = Boston, state = Massachusetts, country = US (it is, but assuming it's from the wrong record).

Any ideas on where I'm going wrong with this would be much appreciated. Also, if anyone can see a way to make it faster, that would also be very much appreciated! Thank you!

Upvotes: 0

Views: 176

Answers (1)

Ben.T
Ben.T

Reputation: 29635

Your problem is at each loop you assign the value to the full column as you don't select which row. To fix, it would be something like df.loc[idx, "city"] = i["name"], where idx would be the index value of the row of the item (except you don't extract it currently).

Anyway, I don't have the library installed, but of what I read on the link provided, I believe you can do without a loop for, directly pass all the coordinates couples in rg.search.

# create a dataframe from the result of rg.search
res = pd.DataFrame(
    # you can pass a tuple of several two coordinates tuple at once
    rg.search(tuple(df['reversePair'].tolist())), 
    # use the same index than df for later join or column assignation
    index=df.index)
print(res.head()) # to have an idea of what it is    

Then you can either

# use join if column's names don't matter too much
new_df = df.join(res[['name','admin1','cc']])

or

# just add the column with the name you wanted
df['city'] = res['name']
df['state'] = res['admin1']
df["country"] = res["cc"]

Upvotes: 0

Related Questions