How does pandas.shift really work?

Question

I want to delete duplicate adjacent rows in a dataframe. I was trying to do this with df[df.shift() != df].dropna().reset_index(drop=True) but shift() is not behaving in the way I meant.

Look at the following example

In [11]: df
Out[11]: 
   x  y
0  a  1
1  b  2
2  b  2
3  e  4
4  e  5
5  f  6
6  g  7
7  h  8

df.x[3] equals df.x[4] but the numbers are different. Though the output is the following:

In [13]: df[df.shift() != df]
Out[13]: 
     x   y
0    a   1
1    b   2
2  NaN NaN
3    e   4
4  NaN   5
5    f   6
6    g   7
7    h   8

I want to delete the row if they are really duplicates, not if they contain some duplicate values. Any idea?

DSM · Accepted Answer

Well, look at df.shift() != df:

>>> df.shift() != df
       x      y
0   True   True
1   True   True
2  False  False
3   True   True
4  False   True
5   True   True
6   True   True
7   True   True

This is a 2D object, not 1D, so when you use it as a filter on a frame you keep the ones where you have True and get NaN with the ones where you have False. It sounds like you want to keep the ones where either are True -- where any are True -- which is a 1D object:

>>> (df.shift() != df).any(axis=1)
0     True
1     True
2    False
3     True
4     True
5     True
6     True
7     True
dtype: bool
>>> df[(df.shift() != df).any(axis=1)]
   x  y
0  a  1
1  b  2
3  e  4
4  e  5
5  f  6
6  g  7
7  h  8

How does pandas.shift really work?

Answers (1)

Related Questions