Reputation: 41

select by common values in multiple pandas dataframes

I'm a newbie to pandas, and suspect this has a simple answer, but it's stumping me right now. I want to select the rows from multiple dataframes (with same columns) where a value in one column occurs in all of the dataframes.

So if I have the following:

import numpy as np
import pandas as pd

df1 = pd.DataFrame({'Col1'['Adams','Baker','Cash','Jones','Thomas'],\
                'Col2': ['A','B','C','D','E'],'Col3':[10,11,12,13,14]})
df2 = pd.DataFrame({'Col1':['Adams','Cash','Fox','Jones','Johnson'],\
               'Col2': ['D','E','F','G','H'],'Col3':[40,50,60,70,80]})
df3 = pd.DataFrame({'Col1': ['Adams','Barns','Jones','Smith','Thomas'],\
               'Col2':['F','G','J','L','M'],'Col3':[21,21,22,23,24]})
print df1
print df2
print df3

Giving:

     Col 1   Col2  Col3
0    Adams   A    10
1    Baker   B    11
2    Cash    C    12
3   Jones    D    13
4  Thomas    E    14
  Col1 Col2  Col3
0    Adams    D    40
1     Cash    E    50
2      Fox    F    60
3    Jones    G    70
4  Johnson    H    80
   Col1 Col2  Col3
0   Adams    F    21
1   Barns    G    21
2   Jones    J    22
3   Smith    L    23
4  Thomas    M    24

I want to end up with:

Adams    A     10
Adams    D     40
Adams    F     21
Jones    D     13
Jones    G     70
Jones    J     22

Is there a succinct way to do this?

Upvotes: 4

Answers (3)

Alexander

Reputation: 109546

First, create a container to hold your dataframes (dfs). You can use reduce on the set intersection of names in Col1 to find names in all of the dataframes. Lastly, you can use query together with concat to join the relevant rows.

dfs = (df1, df2, df3)

common_names = \
    reduce(lambda s, df: s.intersection(set(df.Col1)), # The set intersection of names.
           dfs[1:],  # Sequence of dataframes to pass through reduce starting with second.
           set(dfs[0].Col1))  # Seed reduce with the column of names from first dataframe.

>>> common_names
{'Adams', 'Jones'}

>>> pd.concat((df.query('Col1 in @common_names') for df in dfs), ignore_index=True
              ).sort_values('Col1')
    Col1 Col2  Col3
0  Adams    A    10
2  Adams    D    40
4  Adams    F    21
1  Jones    D    13
3  Jones    G    70
5  Jones    J    22

This solution scales nicely, as you only need to change the contents of dfs should the number of dataframes change.

P.S. After re-reading the question, I realized you're a Pandas newbie. Welcome! Although Pandas is incredibly powerful, there is no simple solution to your question. Basically, you want to filter down your dataframes to get a list of common names. Then you want to filter each dataframe for those names, and concatenate the results.

Upvotes: 0

Ami Tavory

Reputation: 76297

You can first find the common values:

common = \
    set.intersection(set(df1.Col1), set(df2.Col1), set(df3.Col1))

Then concatenate the rows whose values are within the set of common values:

pd.concat([
    df1[df1.Col1.isin(common)],
    df2[df2.Col1.isin(common)],
    df3[df3.Col1.isin(common)]]).sort_values(by='Col1')

Upvotes: 5

bbkglb

Reputation: 769

One solution would be to first get the values of Col1 that exist in all of your DataFrames, then restrict each of the DataFrames to these common values and then concatenate the restricted DataFrames:

common_values = list(set(df1.Col1) & set(df2.Col1) & set(df3.Col1))
df = pd.concat([df1[df1.Col1.isin(common_values)], df2[df2.Col1.isin(common_values)], df3[df3.Col1.isin(common_values)]], ignore_index=True)

This will give you the following result:

print(df)

    Col1 Col2  Col3
   Adams   A    10
   Jones   D    13
   Adams   D    40
   Jones   G    70
   Adams   F    21
   Jones   J    22

I also noticed that your desired DataFrame is sorted, if that is the case, you can sort it with the following command:

df = df.sort(['Col1', 'Col2', 'Col3'])

This will give you:

print(df)

    Col1 Col2  Col3
   Adams   A    10
   Adams   D    40
   Adams   F    21
   Jones   D    13
   Jones   G    70
   Jones   J    22

Upvotes: 0

select by common values in multiple pandas dataframes

Answers (3)

Related Questions