Reputation: 691

Remove duplicates from pandas series based on condition

I have a Pandas series as:

    increased   1.691759
    increased   1.601759
    reports     1.881759
    reports     1.491759
    president   1.386294
    president   1.791759
    exclusive   1.381759
    exclusive   1.291759
    bank        1.386294
    bank        1.791759
    ........    ........
    ........    .......

I just wants remove duplicates words from series and by retaining the word with a higher numeric value. So, expected output,

increased   1.691759
reports     1.881759
president   1.791759
exclusive   1.381759
bank        1.791759
........    ........
........    .......

I have tried it by converting a series into pandas dataframe an it works fine. But, it would be time consuming process as I have large series. So, all I want to process in existing series only.

Upvotes: 2

Answers (2)

sacuL

Reputation: 51335

You can use drop_duplicates after you sort col2. Drop duplicates keeps the first by default, so if you sort by col2 so that the largest is first, it will keep the largest:

df.sort_values('col2', ascending=False).drop_duplicates('col1')

        col1      col2
2    reports  1.881759
5  president  1.791759
9       bank  1.791759
0  increased  1.691759
6  exclusive  1.381759

Alternative using groupby and tail:

Another way would be to do this:

df.sort_values('col2').groupby('col1').tail(1)

        col1      col2
6  exclusive  1.381759
0  increased  1.691759
5  president  1.791759
9       bank  1.791759
2    reports  1.881759

Edit: Based on your comment, to convert to a series for further use you can do:

df.sort_values('col2', ascending=False).drop_duplicates('col1').set_index('col1')['col2']

col1
reports      1.881759
president    1.791759
bank         1.791759
increased    1.691759
exclusive    1.381759
Name: col2, dtype: float64

Or do a groupby directly on the series (but this is slower, see benchmarks):

s.sort_values().groupby(s.index).tail(1)

Benchmark

I tested this with a Series of length 1000000, and even with transforming it to a dataframe and back to a series, it takes less than a second. You might be able to find a faster way without transforming, but this isn't so bad IMO

df = pd.DataFrame({'col1':np.random.choice(['increased', 'reports', 'president', 'exclusive', 'bank'], 1000000), 'col2':np.random.randn(1000000)})

s = pd.Series(df.set_index('col1').col2)

>>> s.head()
col1
president    0.600691
increased    1.752238
president   -1.409425
bank         0.349149
reports      0.596207
Name: col2, dtype: float64
>>> len(s)
1000000

import timeit

def test(s = s):
    return s.to_frame().reset_index().sort_values('col2', ascending=False).drop_duplicates('col1').set_index('col1')['col2']

>>> timeit.timeit(test, number=10) / 10
0.685569432300008

Applying groupby directly on a Series is slower:

def gb_test(s=s):
    return s.sort_values().groupby(s.index).tail(1)

>>> timeit.timeit(gb_test, number=10) / 10
0.7673859989999983

Upvotes: 5

Ely Fialkoff

Reputation: 652

I'm not sure if this method will work on Pandas Dataframe but you can try using set() function. The set() function removes all duplicates.

Upvotes: 0

Remove duplicates from pandas series based on condition

Answers (2)

Related Questions