Reputation: 1461
I have dataframe with information like below stored in one column
>>> Results.Category[:5]
0 issue delivery wrong master account
1 data wrong master account batch
2 order delivery wrong data account
3 issue delivery wrong master account
4 delivery wrong master account batch
Name: Category, dtype: object
Now I want to keep unique word in Category column For Example : In first row word "wrong" is present I want to remove it from all rest of the rows and keep word "wrong" in first row only In second row word "data" is available then I want to remove it from all rest of the rows and keep word "data" in second row only
I found that if duplicates are available in row we can remove using below , but I need to remove duplicate words from columns, Can anyone please help me here.
AFResults['FinalCategoryN'] = AFResults['FinalCategory'].apply(lambda x: remove_dup(x))
Upvotes: 1
Views: 1910
Reputation: 148870
What you what cannot be vectorized, so let us just forget about pandas and use a Python set
:
total = set()
result = []
for line in AFResults['FinalCategory']:
line = set(line.split()).difference(total)
total = total.union(line)
result.append(' '.join(line))
You get that list: ['wrong issue master delivery account', 'batch data', 'order', '', '']
You can use it to populate a dataframe column:
AFResults['FinalCategoryN'] = result
Upvotes: 1
Reputation: 402263
It seems you want something like,
out = []
seen = set()
for c in df['Category']:
words = c.split()
out.append(' '.join([w for w in words if w not in seen]))
seen.update(words)
df['FinalCategoryN'] = out
df
Category FinalCategoryN
0 issue delivery wrong master account issue delivery wrong master account
1 data wrong master account batch data batch
2 order delivery wrong data account order
3 issue delivery wrong master account
4 delivery wrong master account batch
If you don't care about the ordering, you can use set logic:
u = df['Category'].apply(str.split)
v = split.shift().map(lambda x: [] if x != x else x).cumsum().map(set)
(u.map(set) - v).str.join(' ')
0 account delivery issue master wrong
1 batch data
2 order
3
4
Name: Category, dtype: object
Upvotes: 3
Reputation: 323226
In you case you need split
it first then remove duplicate by drop_duplicates
df.c.str.split(expand=True).stack().drop_duplicates().\
groupby(level=0).apply(','.join).reindex(df.index)
Out[206]:
0 issue,delivery,wrong,master,account
1 data,batch
2 order
3 NaN
4 NaN
dtype: object
Upvotes: 2
Reputation: 71560
Use apply
with sorted
and set
and str.join
and list.index
:
AFResults['FinalCategoryN'] = AFResults['FinalCategory'].apply(lambda x: ' '.join(sorted(set(x.split()), key=x.index)))
Upvotes: 0