Reputation: 2930
I have a DataFrame of people
. One of the columns in this DataFrame is a place_id
. I also have a DataFrame of places, where one of the columns is place_id
and another is weather
. For every person, I am trying to find the corresponding weather. Importantly, many people have the same place_id
s.
Currently, my setup is this:
def place_id_to_weather(pid):
return place_df[place_df['place_id'] == pid]['weather'].item()
person_df['weather'] = person_df['place_id'].map(place_id_to_weather)`
But this is untenably slow. I would like to speed this up. I suspect that I could achieve a speedup like this:
Instead of returning place_df[...].item()
, which does a search for place_id == pid
for that entire column and returns a series, and then grabbing the first item in that series, I really just want to curtail the search in place_df
after the first match place_df['place_id']==pid
has been found. After that, I don't need to search any further. How do I limit the search to first occurrences only?
Are there other methods I could use to achieve a speedup here? Some kind of join-type method?
Upvotes: 4
Views: 284
Reputation: 7903
The map function is your quickest method, the purpose of which is to avoid calling an entire dataframe to run some function repeatedly. This is what you ended up doing in your function i.e. calling an entire dataframe which is fine but not good doing it repeatedly. To tweak your code just a little will significantly speed up your process and only call the place_df dataframe once:
person_df['weather'] = person_df['place_id'].map(dict(zip(place_df.place_id, place_df.weather)))
Upvotes: 1
Reputation: 862661
I think you need drop_duplicates
with merge
, if there is only common columns place_id
and weather
in both DataFrames
, you can omit parameter on
(it depends of data, maybe on='place_id'
is necessary):
df1 = place_df.drop_duplicates(['place_id'])
print (df1)
print (pd.merge(person_df, df1))
Sample data:
person_df = pd.DataFrame({'place_id':['s','d','f','s','d','f'],
'A':[4,5,6,7,8,9]})
print (person_df)
A place_id
0 4 s
1 5 d
2 6 f
3 7 s
4 8 d
5 9 f
place_df = pd.DataFrame({'place_id':['s','d','f', 's','d','f'],
'weather':['y','e','r', 'h','u','i']})
print (place_df)
place_id weather
0 s y
1 d e
2 f r
3 s h
4 d u
5 f i
def place_id_to_weather(pid):
#for first occurence add iloc[0]
return place_df[place_df['place_id'] == pid]['weather'].iloc[0]
person_df['weather'] = person_df['place_id'].map(place_id_to_weather)
print (person_df)
A place_id weather
0 4 s y
1 5 d e
2 6 f r
3 7 s y
4 8 d e
5 9 f r
#keep='first' is by default, so can be omit
print (place_df.drop_duplicates(['place_id']))
place_id weather
0 s y
1 d e
2 f r
print (pd.merge(person_df, place_df.drop_duplicates(['place_id'])))
A place_id weather
0 4 s y
1 7 s y
2 5 d e
3 8 d e
4 6 f r
5 9 f r
Upvotes: 2
Reputation: 2553
You can use merge
to do the operation :
people = pd.DataFrame([['bob', 1], ['alice', 2], ['john', 3], ['paul', 2]], columns=['name', 'place'])
# name place
#0 bob 1
#1 alice 2
#2 john 3
#3 paul 2
weather = pd.DataFrame([[1, 'sun'], [2, 'rain'], [3, 'snow'], [1, 'rain']], columns=['place', 'weather'])
# place weather
#0 1 sun
#1 2 rain
#2 3 snow
#3 1 rain
pd.merge(people, weather, on='place')
# name place weather
#0 bob 1 sun
#1 bob 1 rain
#2 alice 2 rain
#3 paul 2 rain
#4 john 3 snow
In case you have several weather for the same place, you may want to use drop_duplicates
, then you have the following result :
pd.merge(people, weather, on='place').drop_duplicates(subset=['name', 'place'])
# name place weather
#0 bob 1 sun
#2 alice 2 rain
#3 paul 2 rain
#4 john 3 snow
Upvotes: 0