Anna Efron
Anna Efron

Reputation: 133

Python dataframe: Dropping rows subject to multiple conditions

Suppose I have the following (example) dataframe:

   a  b  c  d  e
0  9  9  0  9  9
1  1  2  1  9  9
2  8  8  0  2  3
3  7  7  0  7  8
4  1  2  0  3  4
5  6  2  3  6  6
6  1  2  0  1  2
7  1  3  0  1  2

Also suppose I have generated an (arbitrary) list of indices, e.g. [3,4]. For each element in the list, I want to drop all rows from the data frame that have the same values in column 'a' and column 'b' as rows 3 and 4.

Since row 3 has a=7 and b=7, and no other rows have a=7 and b=7, only row 3 is dropped.

Since row 4 has a=1 and b=2, and rows 1 and 6 also have a=1 and b=2, I drop rows 4, 1, and 6.

So the resulting dataframe would look like this:

   a  b  c  d  e
0  9  9  0  9  9
1  8  8  0  2  3
2  6  2  3  6  6
3  1  3  0  1  2

Does anyone know how to come up with a solution to do this quickly (for use in a much larger data frame)? Thank you.

Upvotes: 1

Views: 174

Answers (1)

akuiper
akuiper

Reputation: 214927

Make use of numpy broadcasting;

  • Extract values at indices and columns with loc and reshape it to 3d array:

    df.loc[indices,cols].values[:,None]

  • compare it with columns a and b, which will compare rows 3 and 4 with all other rows because of dimension mismatch and numpy broadcasting

    df[cols].values == df.loc[indices,cols].values[:,None]

  • use .all(2) to make sure both columns match, and any(0) to get matches for either row 3 or row 4

  • Negate ~ and drop matched rows

gives:

indices = [3,4]
cols = ['a','b']
df[~(df[cols].values == df.loc[indices,cols].values[:,None]).all(2).any(0)]

#   a  b  c  d  e
#0  9  9  0  9  9
#2  8  8  0  2  3
#5  6  2  3  6  6
#7  1  3  0  1  2

Upvotes: 1

Related Questions