Reputation: 91
I have a pandas df where each row is a list of words. The list has duplicate words. I want to remove duplicate words.
I tried using dict.fromkeys(listname) in a for loop to iterate over each row in the df. But this splits the words into alphabets
filepath = "C:/abc5/Python/Clustering/output2.csv"
df = pd.read_csv(filepath,encoding='windows-1252')
df["newlist"] = df["text_lemmatized"]
for i in range(0,len(df)):
l = df["text_lemmatized"][i]
df["newlist"][i] = list(dict.fromkeys(l))
print(df)
Expected result is ==>
['clear', 'pending', 'order', 'pending', 'order'] ['clear', 'pending', 'order']
['pending', 'activation', 'clear', 'pending'] ['pending', 'activation', 'clear']
Actual result is
['clear', 'pending', 'order', 'pending', 'order'] ... [[, ', c, l, e, a, r, ,, , p, n, d, i, g, o, ]]
['pending', 'activation', 'clear', 'pending', ... ... [[, ', p, e, n, d, i, g, ,, , a, c, t, v, o, ...
Upvotes: 2
Views: 8684
Reputation: 91
Solution is ==>
import pandas as pd
filepath = "C:/abc5/Python/Clustering/output2.csv"
df = pd.read_csv(filepath,encoding='windows-1252')
df["newlist"] = df["text_lemmatized"]
df['newlist'] = df['text_lemmatized'].map(lambda x: list(set(x)))
print(df)
Thanks to jezrael and all others who helped narrow down to this solution
Upvotes: 0
Reputation: 862691
Problem is there are not lists, but strings, so is necessary convert each value to list by ast.literal_eval
, then is possible convert values to set
s for remove duplicates:
import ast
df['newlist'] = df['text_lemmatized'].map(lambda x: list(set(ast.literal_eval(x))))
print(df)
text_lemmatized newlist
0 [clear, pending, order, pending, order] [clear, pending, order]
1 [pending, activation, clear, pending] [clear, activation, pending]
Or use dict.fromkeys
:
f = lambda x: list(dict.fromkeys(ast.literal_eval(x)))
df['newlist'] = df['text_lemmatized'].map(f)
Another idea is convert column text_lemmatized
to lists in one step and then remove duplicates in another step, advantage is lists in column text_lemmatized
for next processing:
df['text_lemmatized'] = df['text_lemmatized'].map(ast.literal_eval)
df['newlist'] = df['text_lemmatized'].map(lambda x: list(set(x)))
EDIT:
After some discussion solution is:
df['newlist'] = df['text_lemmatized'].map(lambda x: list(set(x)))
Upvotes: 0
Reputation: 51
Your code for removing duplicates seems fine. I tried following and it worked well. Guess the problem is the way you are appending the list in the dataframe column.
`list_from_df = [['clear', 'pending', 'order', 'pending', 'order'],
['pending', 'activation', 'clear', 'pending']]
list_with_unique_words = []
for x in list_from_df:
unique_words = list(dict.fromkeys(x))
list_with_unique_words.append(unique_words)
print(list_with_unique_words)
output [['clear', 'pending', 'order'], ['pending', 'activation', 'clear']]
df["newlist"] = list_with_unique_words
df
`
Upvotes: 0
Reputation: 25239
Just use series.map
and np.unique
Your sample data:
Out[43]:
text_lemmatized
0 [clear, pending, order, pending, order]
1 [pending, activation, clear, pending]
df.text_lemmatized.map(np.unique)
Out[44]:
0 [clear, order, pending]
1 [activation, clear, pending]
Name: val, dtype: object
If you prefer it isn't sorted, use pd.unique
df.text_lemmatized.map(pd.unique)
Out[51]:
0 [clear, pending, order]
1 [pending, activation, clear]
Name: text_lemmatized, dtype: object
Upvotes: 4
Reputation: 161
df.drop_duplicates(subset ="text_lemmatized",
keep = First, inplace = True)
keep = First, means keep the first occurrence
Upvotes: 0
Reputation: 40664
Use set
to remove duplicates.
Also you don't need the for loop
df["newlist"] = list(set( df["text_lemmatized"] ))
Upvotes: 5