jab
jab

Reputation: 5823

Pandas Drop Very First Duplicate only

Let's say I have the following series.

s = pandas.Series([0, 1, 2, 3, 3, 3, 3, 4, 5, 6, 6, 6, 7, 7])

I can keep the first duplicate (for each duplicate value) of the series with the following

s[s.duplicated(keep='first')]

I can keep the last duplicate (for each duplicate value) of the series with the following

s[s.duplicated(keep='last')]

However, I'm looking to do the following.

  1. Drop only the very first duplicate, keep the other duplicates of that matching value, but also keep all other duplicates of varying values (including the first ones of each group). In the example above, we'd drop the first 3, but keep the other 3's. Keep all other remaining duplicates.
  2. Keep the first duplicate, drop the duplicates that matching value, but also keep all the other duplicates of other varying values. In the example above, we'd keep the first 3, but drop all other 3's. Keep all other remaining duplicates.

I've been racking my brain using cumsum() and diff() to capture the change when a duplicate has been detected. I imagine a solution would involve this, but I can't seem to get a perfect solution. I've gone through too many truth tables right now...

Upvotes: 5

Views: 3946

Answers (2)

Woody Pride
Woody Pride

Reputation: 13955

ind = s[s.duplicated()].index[0]

gives you the first index where a record is duplicated. Use it to drop.

In [45]: s.drop(ind)
Out[45]:
0     0
1     1
2     2
4     3
5     3
6     3
7     4
8     5
9     6
10    6
11    6
12    7
13    7
dtype: int64

For part 2, there must be a neat solution, but the only one I can think of is to use create a series of bools to indicate where the index does not equal ind and the value at the index does equal the ind value and then use np.logical_xor:

s[np.logical_xor(s.index != ind, s==s.iloc[ind])]

Out[95]:
0     0
1     1
2     2
4     3
7     4
8     5
9     6
10    6
11    6
12    7
13    7
dtype: int64

Upvotes: 6

piRSquared
piRSquared

Reputation: 294338

  • duplicated to get dups after the first one
  • duplicated(keep=False) to get all dups including first one
  • xor or ^ to find where it's just the first dup
  • NOTE: This drops the first 6 as well

s[~(s.duplicated(keep=False) ^ s.duplicated())]

0     0
1     1
2     2
4     3
5     3
6     3
7     4
8     5
10    6
11    6
13    7
dtype: int64

Upvotes: 4

Related Questions