Reputation: 133
Suppose I have the following (example) dataframe:
a b c d e
0 9 9 0 9 9
1 1 2 1 9 9
2 8 8 0 2 3
3 7 7 0 7 8
4 1 2 0 3 4
5 6 2 3 6 6
6 1 2 0 1 2
7 1 3 0 1 2
Also suppose I have generated an (arbitrary) list of indices, e.g. [3,4]. For each element in the list, I want to drop all rows from the data frame that have the same values in column 'a' and column 'b' as rows 3 and 4.
Since row 3 has a=7 and b=7, and no other rows have a=7 and b=7, only row 3 is dropped.
Since row 4 has a=1 and b=2, and rows 1 and 6 also have a=1 and b=2, I drop rows 4, 1, and 6.
So the resulting dataframe would look like this:
a b c d e
0 9 9 0 9 9
1 8 8 0 2 3
2 6 2 3 6 6
3 1 3 0 1 2
Does anyone know how to come up with a solution to do this quickly (for use in a much larger data frame)? Thank you.
Upvotes: 1
Views: 174
Reputation: 214927
Make use of numpy
broadcasting;
Extract values at indices and columns with loc
and reshape it to 3d array:
df.loc[indices,cols].values[:,None]
compare it with columns a
and b
, which will compare rows 3 and 4 with all other rows because of dimension mismatch and numpy broadcasting
df[cols].values == df.loc[indices,cols].values[:,None]
use .all(2)
to make sure both columns match, and any(0)
to get matches for either row 3 or row 4
Negate ~
and drop matched rows
gives:
indices = [3,4]
cols = ['a','b']
df[~(df[cols].values == df.loc[indices,cols].values[:,None]).all(2).any(0)]
# a b c d e
#0 9 9 0 9 9
#2 8 8 0 2 3
#5 6 2 3 6 6
#7 1 3 0 1 2
Upvotes: 1