Pandas combine dataframes, drop rows that where value does not appear in all initial dataframes

Question

I have two Pandas dataframes, df1 and df2. I would like to combine these into a single dataframe (df) but drop any rows where the value that appears in the 'A' column of df1 but is not present in the 'A' column of df2.

Input:

[in]    df1 =          A     B
                 0     i     y
                 1     ii    y

[in]    df2 =          A     B
                 0     ii    x
                 1     i     y
                 2     iii   z
                 3     iii   z

Desired output:

[out]    df =          A     B
                 0     i     y
                 1     ii    y
                 2     ii    x
                 3     i     y

In the example above, all rows were added to df except those in df2 with 'iii' in the 'A' column, because 'iii' does not appear anywhere in column 'A' of df1.

To take this a step further, the initial number of dataframes is not limited to two. There could be three or more, and I would want to drop any column 'A' values that do not appear in ALL of the dataframes.

How can I make this happen?

Thanks in advance!

Leo · Accepted Answer

This will work for any generic list of dataframes. Also, order of dataframes does not matter.

df1 = pd.DataFrame([['i', 'y'], ['ii', 'y']], columns=['A', 'B'])
df2 = pd.DataFrame([['ii', 'x'], ['i', 'y'], ['iii', 'z'], ['iii', 'z']], columns=['A', 'B'])

dfs = [df1, df2]

set_A = set.intersection(*[set(dfi.A.tolist()) for dfi in dfs])
df = pd.concat([dfi[dfi.A.isin(set_A)] for dfi in dfs])

Pandas combine dataframes, drop rows that where value does not appear in all initial dataframes

Answers (1)

Related Questions