Reputation: 519
I am working with csv file and I have many rows that contain duplicated words and I want to remove any duplicates (I also don't want to lose the order of the sentences).
csv file example (userID and description are the columns name):
userID, description
12, hello world hello world
13, I will keep the 2000 followers same I will keep the 2000 followers same
14, I paid $2000 to the car I paid $2000 to the car I paid $2000 to the car
.
.
I would like to have the output as:
userID, description
12, hello world
13, I will keep the 2000 followers same
14, I paid $2000 to the car
.
.
I already tried the post such as 1 2 3 but none of them fixed my problem and did not change anything. (Order for my output file matters, since I don't want to lose the orders). It would be great if you can provide your help with a code sample that I can run in my side and learn. Thank you
[I am using python 3.7 version]
Upvotes: 2
Views: 894
Reputation: 926
To remove duplicates, I'd suggest a solution involving the OrderedDict data structure:
df['Desired'] = (df['Current'].str.split()
.apply(lambda x: OrderedDict.fromkeys(x).keys())
.str.join(' '))
Upvotes: 2
Reputation: 2022
Answer taken from How can I tell if a string repeats itself in Python?
import pandas as pd
def principal_period(s):
s+=' '
i = (s + s).find(s, 1, -1)
return None if i == -1 else s[:i]
df=pd.read_csv(r'path\to\filename_in.csv')
df['description'].apply(principal_period)
df.to_csv(r'output\path\filename_out.csv')
Explanation:
I have added a space at the end to account for that the repeating strings are delimited by space. Then it looks for second occurring string (minus first and last character to avoid matching first, and last when there are no repeating strings, respectively) when the string is added to itself. This efficiently finds the position of string where the second occuring string starts, or the first shortest repeating string ends. Then this repeating string is returned.
Upvotes: 0
Reputation: 150785
Solution taken from here:
def principal_period(s):
i = (s+s).find(s, 1)
return s[:i]
df['description'].apply(principal_period)
Output:
0 hello world
1 I will keep the 2000 followers the same
2 I paid $2000 to the car
Name: description, dtype: object
Since this uses apply
on string, it might be slow.
Upvotes: 0
Reputation: 9018
The code below works for me:
a = pd.Series(["hello world hello world",
"I will keep the 2000 followers same I will keep the 2000 followers same",
"I paid $2000 to the car I paid $2000 to the car I paid $2000 to the car"])
a.apply(lambda x: " ".join([w for i, w in enumerate(x.split()) if x.split().index(w) == i]))
Basically the idea is to, for each word, only keep it if its position is the first in the list (splitted from string using space). That means, if the word occurred the second (or more) time, the .index()
function will return an index smaller than the position of current occurrence, and thus will be eliminated.
This will give you:
0 hello world
1 I will keep the 2000 followers same
2 I paid $2000 to the car
dtype: object
Upvotes: 0