Reputation: 2930

Pandas Slow. Want first occurrence in DataFrame

I have a DataFrame of people. One of the columns in this DataFrame is a place_id. I also have a DataFrame of places, where one of the columns is place_id and another is weather. For every person, I am trying to find the corresponding weather. Importantly, many people have the same place_ids.

Currently, my setup is this:

def place_id_to_weather(pid):
    return place_df[place_df['place_id'] == pid]['weather'].item() 

person_df['weather'] = person_df['place_id'].map(place_id_to_weather)`

But this is untenably slow. I would like to speed this up. I suspect that I could achieve a speedup like this:

Instead of returning place_df[...].item(), which does a search for place_id == pid for that entire column and returns a series, and then grabbing the first item in that series, I really just want to curtail the search in place_df after the first match place_df['place_id']==pid has been found. After that, I don't need to search any further. How do I limit the search to first occurrences only?

Are there other methods I could use to achieve a speedup here? Some kind of join-type method?

Upvotes: 4

Answers (3)

A.Kot

Reputation: 7903

The map function is your quickest method, the purpose of which is to avoid calling an entire dataframe to run some function repeatedly. This is what you ended up doing in your function i.e. calling an entire dataframe which is fine but not good doing it repeatedly. To tweak your code just a little will significantly speed up your process and only call the place_df dataframe once:

person_df['weather'] = person_df['place_id'].map(dict(zip(place_df.place_id, place_df.weather)))

Upvotes: 1

jezrael

Reputation: 862661

I think you need drop_duplicates with merge, if there is only common columns place_id and weather in both DataFrames, you can omit parameter on (it depends of data, maybe on='place_id' is necessary):

df1 = place_df.drop_duplicates(['place_id'])
print (df1)

print (pd.merge(person_df, df1))

Sample data:

person_df = pd.DataFrame({'place_id':['s','d','f','s','d','f'],
                          'A':[4,5,6,7,8,9]})
print (person_df)
   A place_id
0  4        s
1  5        d
2  6        f
3  7        s
4  8        d
5  9        f

place_df = pd.DataFrame({'place_id':['s','d','f', 's','d','f'],
                         'weather':['y','e','r', 'h','u','i']})
print (place_df)
  place_id weather
0        s       y
1        d       e
2        f       r
3        s       h
4        d       u
5        f       i

def place_id_to_weather(pid):
    #for first occurence add iloc[0]
    return place_df[place_df['place_id'] == pid]['weather'].iloc[0]

person_df['weather'] = person_df['place_id'].map(place_id_to_weather)
print (person_df)
   A place_id weather
0  4        s       y
1  5        d       e
2  6        f       r
3  7        s       y
4  8        d       e
5  9        f       r

#keep='first' is by default, so can be omit
print (place_df.drop_duplicates(['place_id']))
  place_id weather
0        s       y
1        d       e
2        f       r

print (pd.merge(person_df, place_df.drop_duplicates(['place_id'])))
   A place_id weather
0  4        s       y
1  7        s       y
2  5        d       e
3  8        d       e
4  6        f       r
5  9        f       r

Upvotes: 2

3kt

Reputation: 2553

You can use merge to do the operation :

people = pd.DataFrame([['bob', 1], ['alice', 2], ['john', 3], ['paul', 2]], columns=['name', 'place'])

#    name  place
#0    bob      1
#1  alice      2
#2   john      3
#3   paul      2

weather = pd.DataFrame([[1, 'sun'], [2, 'rain'], [3, 'snow'], [1, 'rain']], columns=['place', 'weather'])

#   place weather
#0      1     sun
#1      2    rain
#2      3    snow
#3      1    rain

pd.merge(people, weather, on='place')

#    name  place weather
#0    bob      1     sun
#1    bob      1    rain
#2  alice      2    rain
#3   paul      2    rain
#4   john      3    snow

In case you have several weather for the same place, you may want to use drop_duplicates, then you have the following result :

pd.merge(people, weather, on='place').drop_duplicates(subset=['name', 'place'])

#    name  place weather
#0    bob      1     sun
#2  alice      2    rain
#3   paul      2    rain
#4   john      3    snow

Upvotes: 0

Pandas Slow. Want first occurrence in DataFrame

Answers (3)

Related Questions