PAstudilloE
PAstudilloE

Reputation: 679

Remove duplicates from rows and columns (cell) in a dataframe, python

I have two columns with a lot of duplicated items per cell in a dataframe. Something similar to this:

Index   x    y  
  1     1    ec, us, us, gbr, lst
  2     5    ec, us, us, us, us, ec, ec, ec, ec
  3     8    ec, us, us, gbr, lst, lst, lst, lst, gbr
  4     5    ec, ec, ec, us, us, ir, us, ec, ir, ec, ec
  5     7    chn, chn, chn, ec, ec, us, us, gbr, lst

I need to eliminate all the duplicate items an get a resulting dataframe like this:

Index   x    y  
  1     1    ec, us, gbr, lst
  2     5    ec, us
  3     8    ec, us, gbr,lst
  4     5    ec, us, ir
  5     7    chn, ec, us, gbr, lst

Thanks!!

Upvotes: 12

Views: 6056

Answers (4)

Bharath M Shetty
Bharath M Shetty

Reputation: 30605

Split and apply set and join i.e

df['y'].str.split(', ').apply(set).str.join(', ')

0         us, ec, gbr, lst
1                   us, ec
2         us, ec, gbr, lst
3               us, ec, ir
4    us, lst, ec, gbr, chn
Name: y, dtype: object

Update based on comment :

df['y'].str.replace('nan|[{}\s]','', regex=True).str.split(',').apply(set).str.join(',').str.strip(',').str.replace(",{2,}",",", regex=True)

# Replace all the braces and nan with `''`, then split and apply set and join

Upvotes: 21

Hans Musgrave
Hans Musgrave

Reputation: 7131

If you don't care about item order, and assuming the data type of everything in column y is a string, you can use the following snippet:

df['y'] = df['y'].apply(lambda s: ', '.join(set(s.split(', '))))

The set() conversion is what removes duplicates. I think in later versions of python it might preserve order (3.4+ maybe?), but that is an implementation detail rather than a language specification.

Upvotes: 1

srj
srj

Reputation: 10141

use the apply method on the dataframe.

# change this function according to your needs
def dedup(row):
    return list(set(row.y))

df['deduped'] = df.apply(dedup, axis=1)

Upvotes: 0

koPytok
koPytok

Reputation: 3723

Try this:

d['y'] = d['y'].apply(lambda x: ', '.join(sorted(set(x.split(', ')))))

Upvotes: 1

Related Questions