Reputation: 679
I have two columns with a lot of duplicated items per cell in a dataframe. Something similar to this:
Index x y
1 1 ec, us, us, gbr, lst
2 5 ec, us, us, us, us, ec, ec, ec, ec
3 8 ec, us, us, gbr, lst, lst, lst, lst, gbr
4 5 ec, ec, ec, us, us, ir, us, ec, ir, ec, ec
5 7 chn, chn, chn, ec, ec, us, us, gbr, lst
I need to eliminate all the duplicate items an get a resulting dataframe like this:
Index x y
1 1 ec, us, gbr, lst
2 5 ec, us
3 8 ec, us, gbr,lst
4 5 ec, us, ir
5 7 chn, ec, us, gbr, lst
Thanks!!
Upvotes: 12
Views: 6056
Reputation: 30605
Split
and apply set
and join
i.e
df['y'].str.split(', ').apply(set).str.join(', ')
0 us, ec, gbr, lst
1 us, ec
2 us, ec, gbr, lst
3 us, ec, ir
4 us, lst, ec, gbr, chn
Name: y, dtype: object
Update based on comment :
df['y'].str.replace('nan|[{}\s]','', regex=True).str.split(',').apply(set).str.join(',').str.strip(',').str.replace(",{2,}",",", regex=True)
# Replace all the braces and nan with `''`, then split and apply set and join
Upvotes: 21
Reputation: 7131
If you don't care about item order, and assuming the data type of everything in column y
is a string, you can use the following snippet:
df['y'] = df['y'].apply(lambda s: ', '.join(set(s.split(', '))))
The set()
conversion is what removes duplicates. I think in later versions of python it might preserve order (3.4+ maybe?), but that is an implementation detail rather than a language specification.
Upvotes: 1
Reputation: 10141
use the apply
method on the dataframe.
# change this function according to your needs
def dedup(row):
return list(set(row.y))
df['deduped'] = df.apply(dedup, axis=1)
Upvotes: 0
Reputation: 3723
Try this:
d['y'] = d['y'].apply(lambda x: ', '.join(sorted(set(x.split(', ')))))
Upvotes: 1