Reputation: 691
I have a Pandas series as:
increased 1.691759
increased 1.601759
reports 1.881759
reports 1.491759
president 1.386294
president 1.791759
exclusive 1.381759
exclusive 1.291759
bank 1.386294
bank 1.791759
........ ........
........ .......
I just wants remove duplicates words from series and by retaining the word with a higher numeric value. So, expected output,
increased 1.691759
reports 1.881759
president 1.791759
exclusive 1.381759
bank 1.791759
........ ........
........ .......
I have tried it by converting a series into pandas dataframe an it works fine. But, it would be time consuming process as I have large series. So, all I want to process in existing series only.
Upvotes: 2
Views: 763
Reputation: 51335
You can use drop_duplicates
after you sort col2
. Drop duplicates keeps the first by default, so if you sort by col2
so that the largest is first, it will keep the largest:
df.sort_values('col2', ascending=False).drop_duplicates('col1')
col1 col2
2 reports 1.881759
5 president 1.791759
9 bank 1.791759
0 increased 1.691759
6 exclusive 1.381759
Alternative using groupby
and tail
:
Another way would be to do this:
df.sort_values('col2').groupby('col1').tail(1)
col1 col2
6 exclusive 1.381759
0 increased 1.691759
5 president 1.791759
9 bank 1.791759
2 reports 1.881759
Edit: Based on your comment, to convert to a series for further use you can do:
df.sort_values('col2', ascending=False).drop_duplicates('col1').set_index('col1')['col2']
col1
reports 1.881759
president 1.791759
bank 1.791759
increased 1.691759
exclusive 1.381759
Name: col2, dtype: float64
Or do a groupby directly on the series (but this is slower, see benchmarks):
s.sort_values().groupby(s.index).tail(1)
Benchmark
I tested this with a Series
of length 1000000, and even with transforming it to a dataframe and back to a series, it takes less than a second. You might be able to find a faster way without transforming, but this isn't so bad IMO
df = pd.DataFrame({'col1':np.random.choice(['increased', 'reports', 'president', 'exclusive', 'bank'], 1000000), 'col2':np.random.randn(1000000)})
s = pd.Series(df.set_index('col1').col2)
>>> s.head()
col1
president 0.600691
increased 1.752238
president -1.409425
bank 0.349149
reports 0.596207
Name: col2, dtype: float64
>>> len(s)
1000000
import timeit
def test(s = s):
return s.to_frame().reset_index().sort_values('col2', ascending=False).drop_duplicates('col1').set_index('col1')['col2']
>>> timeit.timeit(test, number=10) / 10
0.685569432300008
Applying groupby
directly on a Series is slower:
def gb_test(s=s):
return s.sort_values().groupby(s.index).tail(1)
>>> timeit.timeit(gb_test, number=10) / 10
0.7673859989999983
Upvotes: 5
Reputation: 652
I'm not sure if this method will work on Pandas Dataframe but you can try using set()
function. The set()
function removes all duplicates.
Upvotes: 0