Pandas does not raise KeyError for missing column with .drop_duplicates()

Question

Something just happened with Pandas which makes me trust it a bit less, does anyone know why it behaves like this? Anyway, for this small example is easy to see, but for a larger dataframe, one would need to take care.. I almost made a mistake with something.

df = pd.DataFrame({"A":[34,12,78,84,26], "B":[54,87,35,81,87], "C":[56,78,0,14,13], "D":[0,87,72,87,14], "E":[78,12,31,0,34]}) 
>> df

Then, if you look for a column which isn't there:

df['b']
KeyError: 'b'

But -

df.drop_duplicates(['b', 'D'])

...runs without error, and finds the error in column D.

Actually, df.drop_duplicates(['D']) produces exactly the same result.

It has missed one duplicate row however has also missed one in column B because it has been misspelled. It doesn't warn you or raise an error.

Using Pandas 0.22.0 and Python 3.6.4.

df.drop_duplicates(['B','D']) just returns the original dataframe without dropping anything. Am I missing something or is Pandas broken?

CezarySzulc · Accepted Answer

Pandas version 0.20.3 python 3.6.

When I run this line of code:

df.drop_duplicates(['b', 'D'])

There is

KeyError: 'b'

In your example is strange situation with row 4.

First

df.loc[4,'B'] = 87

After drop duplicate:

df.loc[4,'B'] = 82

It looks like you have some extra operation between this steps.

Pandas does not raise KeyError for missing column with .drop_duplicates()

Answers (1)

Related Questions