AllanLRH
AllanLRH

Reputation: 1204

Deleting rows not uniquely identified by the index in Pandas

How do I delete (drop) rows from a Pandas DataFrame, in which the rows are not uniquely identified by the index?
To clarify what I mean by uniquely identified: When querying into the index, multiple values are returned.

I can make a mask which, when used as an argument to df.loc, correctly returns the targeted rows, but I can't make it work with df.drop.

Example code:

The output of the print functions is included as a comment.

from __future__ import print_function
import numpy as np
import pandas as pd


np.random.seed(0)
df = pd.DataFrame(
    data=np.random.randint(0, 10, size=(7, 3)),
    index=[['u01', 'u01', 'u01', 'u02', 'u02', 'u03', 'u03'], ['C', 'C', 'C', 'C', 'T', 'T', 'T']],
    columns=['foo', 'bar', 'baz'])
df.index.names = ['user', 'comType']

print(df)
#                  foo  bar  baz
#    user comType
#    u01  C          5    0    3
#         C          3    7    9
#         C          3    5    2
#    u02  C          4    7    6
#         T          8    8    1
#    u03  T          6    7    7
#         T          8    1    5

mask = (df.baz > 8) | (df.baz < 2)
print(mask)
#    user  comType
#    u01   C          False
#          C           True
#          C          False
#    u02   C          False
#          T           True
#    u03   T          False
#          T          False
#    Name: baz, dtype: bool


print(df.loc[mask])
#                  foo  bar  baz
#    user comType
#    u01  C          3    7    9
#    u02  T          8    8    1

df2 = df.drop(mask.index[mask.values])  # Drops all rows of user user u01, one row of user u02
print(df2)
#                  foo  bar  baz
#    user comType
#    u02  C          4    7    6
#    u03  T          6    7    7
#         T          8    1    5

df3 = df.drop(mask)  # Doesn't do anything
print(df3)
#                  foo  bar  baz
#    user comType
#    u01  C          5    0    3
#         C          3    7    9
#         C          3    5    2
#    u02  C          4    7    6
#         T          8    8    1
#    u03  T          6    7    7
#         T          8    1    5



A quick tip: If you want to run the code on your own machine, I suggest replace the print-functions with these lines, as the output otherwise is hard to discern form one another:

print('df', df, sep=" =\n\n", end="\n"*5)
print('mask', mask, sep=" =\n\n", end="\n"*5)
print('df.loc[mask]', df.loc[mask], sep=" =\n\n", end="\n"*5)
print('df2', df2, sep=" =\n\n", end="\n"*5)
print('df3', df3, sep=" =\n\n", end="\n"*5)

Upvotes: 1

Views: 145

Answers (2)

jezrael
jezrael

Reputation: 863226

Use inverting boolean mask by ~ with boolean indexing:

mask = (df.baz > 8) | (df.baz < 2)
print (mask)
user  comType
u01   C          False
      C           True
      C          False
u02   C          False
      T           True
u03   T          False
      T          False
Name: baz, dtype: bool

print(~mask)
user  comType
u01   C           True
      C          False
      C           True
u02   C           True
      T          False
u03   T           True
      T           True
Name: baz, dtype: bool

print (df[~mask])
              foo  bar  baz
user comType               
u01  C          5    0    3
     C          3    5    2
u02  C          4    7    6
u03  T          6    7    7
     T          8    1    5

Upvotes: 1

EdChum
EdChum

Reputation: 394179

As you already had a boolean mask, you could've just used this rather than drop by inverting using unary not ~:

In [199]:
mask

Out[199]:
u01  C    False
     C     True
     C    False
u02  C     True
     T    False
u03  T     True
     T    False
Name: baz, dtype: bool

In [198]:    
df[~mask]

Out[198]:
       foo  bar  baz
u01 C    0    5    8
    C    5    2    7
u02 T    6    7    3
u03 T    6    5    5

drop takes label values for dropping, what you passed was a full boolean mask so nothing was dropped as all labels were present, besides what you wanted was the index values but even then it would not have worked:

In [212]:
df.drop(mask[~mask].index)

Out[212]:
       foo  bar  baz
u02 C    3    7    1

because your multi-index labels are repeated, at the first level hence why you end up with a single row, the mask was all you needed

Upvotes: 1

Related Questions