Reputation: 3485
I want to find the values of col1
and col2
where the col1
and col2
of the first dataframe are both in the second dataframe.
These rows should be in the result dataframe:
pizza, boy
pizza, girl
ice cream, boy
because all three rows are in the first and second dataframes.
How do I possibly accomplish this? I was thinking of using isin
, but I am not sure how to use it when I have to consider more than one column.
Upvotes: 25
Views: 59614
Reputation: 1048
One possible way is to define a check function of your own and perform apply
on the dataframe.
For example, if you know the list of combinations that need to be filtered (this list can be extracted beforehand from a dataframe):
filter_list_multicols = [["book", "cat"], ["table", "dog"], ["table", "cat"], ["pen", "horse"], ["book", "horse"]]
Then you could define a check function as so:
def isin_multicols_check(stationary_name, animal_name):
for filter_pair in filter_list_multicols:
if (stationary_name == filter_pair[0]) and (animal_name == filter_pair[1]):
return True
return False
Example dataframe:
df = pd.DataFrame([
[1, "book", "dog"], [2, "pen", "dog"], [3, "pen", "rat"], [4, "book", "horse"], [5, "book", "cat"]
],
columns=["S.N.", "stationary_name", "animal_name"])
df
S.N. stationary_name animal_name
1 book dog
2 pen dog
3 pen rat
4 book horse
5 book cat
And now, call the function using pandas apply
:
df["is_in"] = df.apply(lambda x: isin_multicols_check(x.stationary_name, x.animal_name), axis=1)
df
S.N. stationary_name animal_name is_in
1 book dog false
2 pen dog false
3 pen rat false
4 book horse true
5 book cat true
The result:
is_in = df[df["is_in"]==True]
not_is_in = df[df["is_in"]==False]
Upvotes: 1
Reputation: 894
If somehow you must stick to isin
or the negate version ~isin
.
You may first create a new column, with the concatenation of col1
, col2
. Then use isin
to filter your data. Here is the code:
import pandas as pd
df1 = pd.DataFrame({'col1': ['pizza', 'hamburger', 'hamburger', 'pizza', 'ice cream'], 'col2': ['boy', 'boy', 'girl', 'girl', 'boy']}, index=range(1,6))
df2 = pd.DataFrame({'col1': ['pizza', 'pizza', 'chicken', 'cake', 'cake', 'chicken', 'ice cream'], 'col2': ['boy', 'girl', 'girl', 'boy', 'girl', 'boy', 'boy']}, index=range(10,17))
df1['indicator'] = df1['col1'].str.cat(df1['col2'])
df2['indicator'] = df2['col1'].str.cat(df2['col2'])
df2.loc[df2['indicator'].isin(df1['indicator'])].drop(columns=['indicator'])
which gives
col1 col2
10 pizza boy
11 pizza girl
16 ice cream boy
If you do so remember to make sure that concatenating two columns doesn't create false positives e.g. concatenation of 123
and 456
in df1
and concatenation of 12
and 3456
in df2
will match even though their respective columns don't match. You can fix this problem by additional sep
parameter.
df1['indicator'] = df1['col1'].str.cat(df1['col2'], sep='$$$')
df2['indicator'] = df2['col1'].str.cat(df2['col2'], sep='$$$')
Upvotes: 4
Reputation: 1257
Thank you unutbu! Here is a little update.
import pandas as pd
df1 = pd.DataFrame({'col1': ['pizza', 'hamburger', 'hamburger', 'pizza', 'ice cream'], 'col2': ['boy', 'boy', 'girl', 'girl', 'boy']}, index=range(1,6))
df2 = pd.DataFrame({'col1': ['pizza', 'pizza', 'chicken', 'cake', 'cake', 'chicken', 'ice cream'], 'col2': ['boy', 'girl', 'girl', 'boy', 'girl', 'boy', 'boy']}, index=range(10,17))
df1[df1.set_index(['col1','col2']).index.isin(df2.set_index(['col1','col2']).index)]
return:
col1 col2
1 pizza boy
4 pizza girl
5 ice cream boy
Upvotes: 16
Reputation: 879351
Perform an inner merge on col1
and col2
:
import pandas as pd
df1 = pd.DataFrame({'col1': ['pizza', 'hamburger', 'hamburger', 'pizza', 'ice cream'], 'col2': ['boy', 'boy', 'girl', 'girl', 'boy']}, index=range(1,6))
df2 = pd.DataFrame({'col1': ['pizza', 'pizza', 'chicken', 'cake', 'cake', 'chicken', 'ice cream'], 'col2': ['boy', 'girl', 'girl', 'boy', 'girl', 'boy', 'boy']}, index=range(10,17))
print(pd.merge(df2.reset_index(), df1, how='inner').set_index('index'))
yields
col1 col2
index
10 pizza boy
11 pizza girl
16 ice cream boy
The purpose of the reset_index
and set_index
calls are to preserve df2
's index as in the desired result you posted. If the index is not important, then
pd.merge(df2, df1, how='inner')
# col1 col2
# 0 pizza boy
# 1 pizza girl
# 2 ice cream boy
would suffice.
Alternatively, you could construct MultiIndex
s out of the col1
and col2
columns, and then call the MultiIndex.isin
method:
index1 = pd.MultiIndex.from_arrays([df1[col] for col in ['col1', 'col2']])
index2 = pd.MultiIndex.from_arrays([df2[col] for col in ['col1', 'col2']])
print(df2.loc[index2.isin(index1)])
yields
col1 col2
10 pizza boy
11 pizza girl
16 ice cream boy
Upvotes: 28