Bilgin
Bilgin

Reputation: 519

How to remove duplicated words in csv rows in python?

I am working with csv file and I have many rows that contain duplicated words and I want to remove any duplicates (I also don't want to lose the order of the sentences).

csv file example (userID and description are the columns name):

userID, description

12, hello world hello world

13, I will keep the 2000 followers same I will keep the 2000 followers same

14, I paid $2000 to the car I paid $2000 to the car I paid $2000 to the car

.

.

I would like to have the output as:

userID, description

12, hello world 

13, I will keep the 2000 followers same

14, I paid $2000 to the car 

.

.

I already tried the post such as 1 2 3 but none of them fixed my problem and did not change anything. (Order for my output file matters, since I don't want to lose the orders). It would be great if you can provide your help with a code sample that I can run in my side and learn. Thank you

[I am using python 3.7 version]

Upvotes: 2

Views: 894

Answers (4)

Display name
Display name

Reputation: 926

To remove duplicates, I'd suggest a solution involving the OrderedDict data structure:

df['Desired'] = (df['Current'].str.split()
                          .apply(lambda x: OrderedDict.fromkeys(x).keys())
                          .str.join(' '))

Upvotes: 2

Ricky Kim
Ricky Kim

Reputation: 2022

Answer taken from How can I tell if a string repeats itself in Python?

import pandas as pd
def principal_period(s):
    s+=' '
    i = (s + s).find(s, 1, -1)
    return None if i == -1 else s[:i]
df=pd.read_csv(r'path\to\filename_in.csv')
df['description'].apply(principal_period)
df.to_csv(r'output\path\filename_out.csv')

Explanation:

I have added a space at the end to account for that the repeating strings are delimited by space. Then it looks for second occurring string (minus first and last character to avoid matching first, and last when there are no repeating strings, respectively) when the string is added to itself. This efficiently finds the position of string where the second occuring string starts, or the first shortest repeating string ends. Then this repeating string is returned.

Upvotes: 0

Quang Hoang
Quang Hoang

Reputation: 150785

Solution taken from here:

def principal_period(s):
    i = (s+s).find(s, 1)
    return s[:i]

df['description'].apply(principal_period)

Output:

0                                 hello world
1     I will keep the 2000 followers the same
2                     I paid $2000 to the car
Name: description, dtype: object

Since this uses apply on string, it might be slow.

Upvotes: 0

Yilun Zhang
Yilun Zhang

Reputation: 9018

The code below works for me:

a = pd.Series(["hello world hello world", 
               "I will keep the 2000 followers same I will keep the 2000 followers same",
               "I paid $2000 to the car I paid $2000 to the car I paid $2000 to the car"])
a.apply(lambda x: " ".join([w for i, w in enumerate(x.split()) if x.split().index(w) == i]))

Basically the idea is to, for each word, only keep it if its position is the first in the list (splitted from string using space). That means, if the word occurred the second (or more) time, the .index() function will return an index smaller than the position of current occurrence, and thus will be eliminated.

This will give you:

0                            hello world
1    I will keep the 2000 followers same
2                I paid $2000 to the car
dtype: object

Upvotes: 0

Related Questions