irene
irene

Reputation: 2243

How to find common elements in several dataframes

I have the following dataframes:

df1 = pd.DataFrame({'col1': ['A','M','C'],
                   'col2': ['B','N','O'],
                   # plus many more
                   })

df2 = pd.DataFrame({'col3': ['A','A','A','B','B','B'],
                    'col4': ['M','P','Q','J','P','M'],
                   # plus many more
                   })

Which look like these:

df1:

col1   col2
A      B
M      N
C      O
#...plus many more

df2:

col3   col4
A      M
A      P
A      Q
B      J
B      P
B      M
#...plus many more

The objective is to create a dataframe containing all elements in col4 for each col3 that occurs in one row in df1. For example, let's look at row 1 of df1. We see that A is in col1 and B is in col2. Then, we go to df2, and check what col4 is for df2[df2['col3'] == 'A'] and df2[df2['col3'] == 'B']. We get, for A: ['M','P','Q'], and for B, ['J','P','M']. The intersection of these is['M', 'P'], so what I want is something like this

col1   col2    col4
A      B       M
A      B       P
....(and so on for the other rows)

The naive way to go about this is to iterate over rows and then get the intersection, but I was wondering if it's possible to solve this via merging techniques or other faster methods. So far, I can't think of any way how.

Upvotes: 1

Views: 49

Answers (1)

iacob
iacob

Reputation: 24221

This should achieve what you want, using a combination of merge, groupby and set intersection:

# Getting tuple of all col1=col3 values in col4
df3 = pd.merge(df1, df2, left_on='col1', right_on='col3')
df3 = df3.groupby(['col1', 'col2'])['col4'].apply(tuple)
df3 = df3.reset_index()

# Getting tuple of all col2=col3 values in col4
df3 = pd.merge(df3, df2, left_on='col2', right_on='col3')
df3 = df3.groupby(['col1', 'col2', 'col4_x'])['col4_y'].apply(tuple)
df3 = df3.reset_index()

# Taking set intersection of our two tuples
df3['col4'] = df3.apply(lambda row: set(row['col4_x']) & set(row['col4_y']), axis=1)

# Dropping unnecessary columns
df3 = df3.drop(['col4_x', 'col4_y'], axis=1)

print(df3)

  col1 col2    col4
0    A    B  {P, M}

If required, see this answer for examples of how to 'melt' col4.

Upvotes: 1

Related Questions