Reputation: 2521
if df['col']='a','b','c'
and df2['col']='a123','b456','d789'
how do I create df2['is_contained']='a','b','no_match'
where if values from df['col']
are found within values from df2['col']
the df['col']
value is returned and if no match is found, 'no_match' is returned? Also I don't expect there to be multiple matches, but in the unlikely case there are, I'd want to return a string like 'Multiple Matches'.
Upvotes: 13
Views: 44264
Reputation: 39463
You must first guarantee that the indexes match. To simplify, I'll show as if the columns where in the same dataframe. The trick is to use the apply method in the columns axis:
df = pd.DataFrame({'col1': ['a', 'b', 'c', 'd'],
'col2': ['a123','b456','d789', 'a']})
df['contained'] = df.apply(lambda x: x.col1 in x.col2, axis=1)
df
col1 col2 contained
0 a a123 True
1 b b456 True
2 c d789 False
3 d a False
Upvotes: 3
Reputation: 375905
In 0.13, you can use str.extract
:
In [11]: df1 = pd.DataFrame({'col': ['a', 'b', 'c']})
In [12]: df2 = pd.DataFrame({'col': ['d23','b456','a789']})
In [13]: df2.col.str.extract('(%s)' % '|'.join(df1.col))
Out[13]:
0 NaN
1 b
2 a
Name: col, dtype: object
Upvotes: 1
Reputation: 7018
With this toy data set, we want to add a new column to df2
which will contain no_match
for the first three rows, and the last row will contain the value 'd'
due to the fact that that row's col
value (the letter 'a'
) appears in df1.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df1 = pd.DataFrame({'col': ['a', 'b', 'c', 'd']})
df2 = pd.DataFrame({'col': ['a123','b456','d789', 'a']})
In other words, values from df1
should be used to populate this new column in df2
only when a row's df2['col']
value appears somewhere in df1['col']
.
In [2]: df1
Out[2]:
col
0 a
1 b
2 c
3 d
In [3]: df2
Out[3]:
col
0 a123
1 b456
2 d789
3 a
If this is the right way to understand your question, then you can do this with pandas isin
:
In [4]: df2.col.isin(df1.col)
Out[4]:
0 False
1 False
2 False
3 True
Name: col, dtype: bool
This evaluates to True
only when a value in df2.col
is also in df1.col
.
Then you can use np.where
which is more or less the same as ifelse
in R if you are familiar with R at all.
In [5]: np.where(df2.col.isin(df1.col), df1.col, 'NO_MATCH')
Out[5]:
0 NO_MATCH
1 NO_MATCH
2 NO_MATCH
3 d
Name: col, dtype: object
For rows where a df2.col
value appears in df1.col
, the value from df1.col
will be returned for the given row index. In cases where the df2.col
value is not a member of df1.col
, the default 'NO_MATCH'
value will be used.
Upvotes: 8