Red Sparrow
Red Sparrow

Reputation: 397

Conditionally filter rows in one df for specific columns which are common to subset of another df

Lets assume a df1

 df1 = pd.DataFrame(
{'col1': {0: 500.0, 1: 500.0, 2: 833.3, 3: 500.0, 4: 833.3, 5: 500.0, 6: 833.3},
'col2': {0: 1833.3, 1: 1000.0, 2: 1833.3, 3: 2666.7, 4: 1833.3, 5: 3500.0, 6: 1000.0},
'col3': {0: 250.0, 1: 250.0, 2: 30.0, 3: 30.0, 4: 30.0, 5: 103.3, 6: 176.7},
'col4': {0: 3.4, 1: 4.0, 2: 2.2, 3: 3.4, 4: 2.2, 5: 4.0, 6: 3.4},
'col5': {0: 0.25, 1: 0.15, 2: 0.1, 3: 0.25, 4: 0.25, 5: 0.1, 6: 0.1},
'col6': {0: 364, 1: 937, 2: 579, 3: 313, 4: 600, 5: 49, 6: 13}})

And a df2

 df2 = pd.DataFrame(
{'col1': {0: 833.3, 1: 500.0, 2: 500.0, 3: 500.0, 4: 500.0, 5: 500.0, 6: 500.0, 7: 833.3, 8: 500.0, 9: 833.3, 10: 500.0, 11: 500.0, 12: 833.3, 13: 833.3, 14: 833.3},
'col2': {0: 1833.3, 1: 1000.0, 2: 1833.3, 3: 3500.0, 4: 3500.0, 5: 1000.0, 6: 2666.7, 7: 1833.3, 8: 2666.7, 9: 1000.0, 10: 2666.7, 11: 2666.7, 12: 1000.0, 13: 1833.3, 14: 1833.3},
'col3': {0: 30.0, 1: 250.0, 2: 250.0, 3: 103.3, 4: 176.7, 5: 103.3, 6: 30.0, 7: 103.3, 8: 30.0, 9: 176.7, 10: 250.0, 11: 103.3, 12: 30.0, 13: 30.0, 14: 250.0},
'col4': {0: 2.2, 1: 4.0, 2: 3.4, 3: 4.0, 4: 2.2, 5: 2.8, 6: 2.8, 7: 2.8, 8: 3.4, 9: 3.4, 10: 2.8, 11: 2.8, 12: 3.4, 13: 2.2, 14: 2.8}, 
'col5': {0: 0.25, 1: 0.15, 2: 0.25, 3: 0.1, 4: 0.2, 5: 0.15, 6: 0.15, 7: 0.25, 8: 0.25, 9: 0.1, 10: 0.15, 11: 0.1, 12: 0.15, 13: 0.1, 14: 0.2}})

What is the most pythonic way to drop the rows in df2 where col1 and col2 and col3 and col4 (and coln for that matter) have the same values as the respective columns of df1? I don't want to merge the dataframes, only drop any rows in df2 (might be multiple) where the row tuple over the interest columns is the same in both dfs.

I have only come up with how to do this using :

new_df = df2.loc[df2[col1].isin(df1[col1]) &
              df2[col2].isin(df1[col2]) &
              df2[col3].isin(df1[col3]) &
              df2[col4].isin(df1[col4]) &
              df2[col5].isin(df1[col5]) ]

which is a bit cumbersome for larger datasets and for more columns.

Any ideas for a better way?

Upvotes: 1

Views: 136

Answers (2)

jpp
jpp

Reputation: 164673

You can use set_index combined with pd.Index.difference to extract your result:

idx = df2.set_index(list(df2)).index\
         .difference(df1.set_index(list(df1)).index)

res = df2.set_index(list(df2)).loc[idx].reset_index()

The benefit of this method is it doesn't require numeric to string conversion.

Upvotes: 1

BENY
BENY

Reputation: 323236

You can using isin,before doing that we need create a key using all value from col1~ coln(convert to str and paste together)

df2[~df2[df1.columns].astype(str).sum(1).isin(df1.astype(str).sum(1))]

Upvotes: 2

Related Questions