Reputation: 5508
I have two DataFrame, one is 'recipe', the combination of the ingredients, the other is 'like', which contains the popular combinations.
recipe = pd.DataFrame({'A': ['chicken','beef','pork','egg', 'chicken', 'egg', 'beef'],
'B': ['sweet', 'hot', 'salty', 'hot', 'sweet', 'salty', 'hot']})
recipe
A B
0 chicken sweet
1 beef hot
2 pork salty
3 egg hot
4 chicken sweet
5 egg salty
6 beef hot
like = pd.DataFrame({'A':['beef', 'egg'], 'B':['hot', 'salty']})
like
A B
0 beef hot
1 egg salty
How can I add a column 'C' to recipe, if the combination listed in 'like', then I give it value 'yes', otherwise 'no'?
The result I want is
recipe
A B C
0 chicken sweet no
1 beef hot yes
2 pork salty no
3 egg hot no
4 chicken sweet no
5 egg salty yes
6 beef hot yes
The problem is my both dataframes are large. I can not manually choose the items in 'like' and assign the 'yes' label in 'recipe'. Are there any easy ways to do that?
Upvotes: 1
Views: 3405
Reputation: 15953
You can use set_value
by matching both A
and B
as such:
recipe.set_value(recipe[recipe.A.isin(like.A) & recipe.B.isin(like.B)].index,'C','yes')
recipe.fillna('no')
Which will give you:
A B C
0 chicken sweet no
1 beef hot yes
2 pork salty no
3 egg hot yes
4 chicken sweet no
5 egg salty yes
6 beef hot yes
Note: These results do not mean my answer is better than other ones or vice versa.
Using set_value
:
%timeit recipe.set_value(recipe[recipe.A.isin(like.A) & recipe.B.isin(like.B)].index,'C','yes'); recipe.fillna('no')
100 loops, best of 3: 2.69 ms per loop
Using merge
and creating new df
:
%timeit df = pd.merge(recipe, like, on=['A','B'], indicator=True, how='left'); df['C'] = np.where(df['_merge'] == 'both', 'yes', 'no')
100 loops, best of 3: 8.42 ms per loop
Using merge
only:
%timeit df['C'] = np.where(df['_merge'] == 'both', 'yes', 'no')
1000 loops, best of 3: 187 µs per loop
Again, it really depends on what you're timing. Just be cautious of duplicating your data.
Upvotes: 1
Reputation: 862671
You can use merge
and numpy.where
:
df = pd.merge(recipe, like, on=['A','B'], indicator=True, how='left')
print df
A B _merge
0 chicken sweet left_only
1 beef hot both
2 pork salty left_only
3 egg hot left_only
4 chicken sweet left_only
5 egg salty both
6 beef hot both
df['C'] = np.where(df['_merge'] == 'both', 'yes', 'no')
print df[['A','B','C']]
A B C
0 chicken sweet no
1 beef hot yes
2 pork salty no
3 egg hot no
4 chicken sweet no
5 egg salty yes
6 beef hot yes
Faster is use df['_merge'] == 'both'
:
In [460]: %timeit np.where(np.in1d(df['_merge'],'both'), 'yes', 'no')
100 loops, best of 3: 2.22 ms per loop
In [461]: %timeit np.where(df['_merge'] == 'both', 'yes', 'no')
1000 loops, best of 3: 652 µs per loop
Upvotes: 2
Reputation: 879591
You could add a C
column of 'yes'
s to like
and then merge recipe
with like
.
The rows that match will have yes
in the C
column, the rows without a match will have NaN
s. You could then use fillna
to replace the NaNs with 'no'
s:
import pandas as pd
recipe = pd.DataFrame({'A': ['chicken','beef','pork','egg', 'chicken', 'egg', 'beef'],
'B': ['sweet', 'hot', 'salty', 'hot', 'sweet', 'salty', 'hot']})
like = pd.DataFrame({'A':['beef', 'egg'], 'B':['hot', 'salty']})
like['C'] = 'yes'
result = pd.merge(recipe, like, how='left').fillna('no')
print(result)
yields
A B C
0 chicken sweet no
1 beef hot yes
2 pork salty no
3 egg hot no
4 chicken sweet no
5 egg salty yes
6 beef hot yes
Upvotes: 1