user3734568
user3734568

Reputation: 1461

remove duplicate word from pandas column

I have dataframe with information like below stored in one column

>>> Results.Category[:5]
0    issue delivery wrong master account
1      data wrong master account batch
2    order delivery wrong data account
3    issue delivery wrong master account
4    delivery wrong master account batch
Name: Category, dtype: object

Now I want to keep unique word in Category column For Example : In first row word "wrong" is present I want to remove it from all rest of the rows and keep word "wrong" in first row only In second row word "data" is available then I want to remove it from all rest of the rows and keep word "data" in second row only

I found that if duplicates are available in row we can remove using below , but I need to remove duplicate words from columns, Can anyone please help me here.

AFResults['FinalCategoryN'] = AFResults['FinalCategory'].apply(lambda x: remove_dup(x))

Upvotes: 1

Views: 1910

Answers (4)

Serge Ballesta
Serge Ballesta

Reputation: 148870

What you what cannot be vectorized, so let us just forget about pandas and use a Python set:

total = set()
result = []
for line in AFResults['FinalCategory']:
    line = set(line.split()).difference(total)
    total = total.union(line)
    result.append(' '.join(line))

You get that list: ['wrong issue master delivery account', 'batch data', 'order', '', '']

You can use it to populate a dataframe column:

AFResults['FinalCategoryN'] = result

Upvotes: 1

cs95
cs95

Reputation: 402263

It seems you want something like,

out = []
seen = set()
for c in df['Category']:
    words = c.split()
    out.append(' '.join([w for w in words if w not in seen]))
    seen.update(words)

df['FinalCategoryN'] = out
df

                              Category                       FinalCategoryN
0  issue delivery wrong master account  issue delivery wrong master account
1      data wrong master account batch                           data batch
2    order delivery wrong data account                                order
3  issue delivery wrong master account                                     
4  delivery wrong master account batch                                     

If you don't care about the ordering, you can use set logic:

u = df['Category'].apply(str.split)
v = split.shift().map(lambda x: [] if x != x else x).cumsum().map(set)
(u.map(set) - v).str.join(' ')

0    account delivery issue master wrong
1                             batch data
2                                  order
3                                       
4                                       
Name: Category, dtype: object

Upvotes: 3

BENY
BENY

Reputation: 323226

In you case you need split it first then remove duplicate by drop_duplicates

df.c.str.split(expand=True).stack().drop_duplicates().\
     groupby(level=0).apply(','.join).reindex(df.index)
Out[206]: 
0    issue,delivery,wrong,master,account
1                             data,batch
2                                  order
3                                    NaN
4                                    NaN
dtype: object

Upvotes: 2

U13-Forward
U13-Forward

Reputation: 71560

Use apply with sorted and set and str.join and list.index:

AFResults['FinalCategoryN'] = AFResults['FinalCategory'].apply(lambda x: ' '.join(sorted(set(x.split()), key=x.index)))

Upvotes: 0

Related Questions