Reputation: 1204
How do I delete (drop) rows from a Pandas DataFrame, in which the rows are not uniquely identified by the index?
To clarify what I mean by uniquely identified: When querying into the index, multiple values are returned.
I can make a mask which, when used as an argument to df.loc
, correctly returns the targeted rows, but I can't make it work with df.drop
.
The output of the print functions is included as a comment.
from __future__ import print_function
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame(
data=np.random.randint(0, 10, size=(7, 3)),
index=[['u01', 'u01', 'u01', 'u02', 'u02', 'u03', 'u03'], ['C', 'C', 'C', 'C', 'T', 'T', 'T']],
columns=['foo', 'bar', 'baz'])
df.index.names = ['user', 'comType']
print(df)
# foo bar baz
# user comType
# u01 C 5 0 3
# C 3 7 9
# C 3 5 2
# u02 C 4 7 6
# T 8 8 1
# u03 T 6 7 7
# T 8 1 5
mask = (df.baz > 8) | (df.baz < 2)
print(mask)
# user comType
# u01 C False
# C True
# C False
# u02 C False
# T True
# u03 T False
# T False
# Name: baz, dtype: bool
print(df.loc[mask])
# foo bar baz
# user comType
# u01 C 3 7 9
# u02 T 8 8 1
df2 = df.drop(mask.index[mask.values]) # Drops all rows of user user u01, one row of user u02
print(df2)
# foo bar baz
# user comType
# u02 C 4 7 6
# u03 T 6 7 7
# T 8 1 5
df3 = df.drop(mask) # Doesn't do anything
print(df3)
# foo bar baz
# user comType
# u01 C 5 0 3
# C 3 7 9
# C 3 5 2
# u02 C 4 7 6
# T 8 8 1
# u03 T 6 7 7
# T 8 1 5
A quick tip: If you want to run the code on your own machine, I suggest replace the print-functions with these lines, as the output otherwise is hard to discern form one another:
print('df', df, sep=" =\n\n", end="\n"*5)
print('mask', mask, sep=" =\n\n", end="\n"*5)
print('df.loc[mask]', df.loc[mask], sep=" =\n\n", end="\n"*5)
print('df2', df2, sep=" =\n\n", end="\n"*5)
print('df3', df3, sep=" =\n\n", end="\n"*5)
Upvotes: 1
Views: 145
Reputation: 863226
Use inverting boolean mask by ~
with boolean indexing
:
mask = (df.baz > 8) | (df.baz < 2)
print (mask)
user comType
u01 C False
C True
C False
u02 C False
T True
u03 T False
T False
Name: baz, dtype: bool
print(~mask)
user comType
u01 C True
C False
C True
u02 C True
T False
u03 T True
T True
Name: baz, dtype: bool
print (df[~mask])
foo bar baz
user comType
u01 C 5 0 3
C 3 5 2
u02 C 4 7 6
u03 T 6 7 7
T 8 1 5
Upvotes: 1
Reputation: 394179
As you already had a boolean mask, you could've just used this rather than drop by inverting using unary not ~
:
In [199]:
mask
Out[199]:
u01 C False
C True
C False
u02 C True
T False
u03 T True
T False
Name: baz, dtype: bool
In [198]:
df[~mask]
Out[198]:
foo bar baz
u01 C 0 5 8
C 5 2 7
u02 T 6 7 3
u03 T 6 5 5
drop
takes label values for dropping, what you passed was a full boolean mask so nothing was dropped as all labels were present, besides what you wanted was the index values but even then it would not have worked:
In [212]:
df.drop(mask[~mask].index)
Out[212]:
foo bar baz
u02 C 3 7 1
because your multi-index labels are repeated, at the first level hence why you end up with a single row, the mask was all you needed
Upvotes: 1