How to remove duplicate words from same string in csv column

Question

I have a csv file with many repeated words due to concatenation (that part is out of my control) in the same cell. This usually occurs in the same column. This is an example like what I have:

Name,Geo Location,Default
DRE EXT EXT Pair Video,,
DRE United Kingdom EXT LON Extrane lo.EXT LON RD01,United Kingdom,
DRE United Kingdom EXT LON Extrane lo.EXT LON RD02,United Kingdom,
DRE United Kingdom LON lab dyna test LON,United Kingdom,
DRE United StatesCPT Corp Point Link_Pair Video DRE,United States,
DRE United Kingdom SDD SASD-D TRAIL01 to RD01,United Kingdom,
DRE United Kingdom SDD SASD-D TRAIL01 to RD02 SASD-D,United Kingdom,
DRE United Kingdom SDD SASD-D TRAIL02 to RD01,United Kingdom,
DRE United Kingdom SDD SASD-D TRAIL02 to RD02,United Kingdom,
DRE United Kingdom SDD SASD-D TRAIL01 to TRAIL02,United Kingdom,
DRE United Kingdom SDD SASD-D RD01 to RD02,United Kingdom,
DRE United States MDR SASD-D XC SASD-D Xplay to SASD-D,United States,
DRE Hong Kong (China) Hongkong  HKOuter RD01 HKInter,"Hong Kong, Hong Kong",
DRE United Kingdom DRE LON Sq lab dynam test,United Kingdom,
DRE United States USTHA SPS Thalberg usthamd mdf01,United States,
DRE Hong Kong (China)DRE SASD-D Hong Kong Citi SASD-D EXT,Hong Kong,
SASD-D United States SASD-D USPHXCAP VRF SASD-D USPHXCAP RD02,United States,

I need to remove the duplicate words, but only in the same cell.

I have started with the code below which is based on a lot of other question/answers on here about similar subjects. My code is not working and I don't know what to do to make it work or if there is another better method for this.

from csv import DictReader, DictWriter

with open('file1.csv') as fi1,\
     open('file2.csv', 'wb') as fout1:

    read1 = DictReader(fi1)

    write1 = DictWriter(fout1, fieldnames=read1.fieldnames)
    write1.writeheader()



    for line1 in read1:
        col=line1['Name']
        outline = dict(line1)
        ' '.join(set(col.split()))
        write1.writerow(outline)

I need help making this work or with another method to make it work. I was thinking if there was a method to clear the set between lines, it might work.

Thanks, B0T

abarnert · Accepted Answer

The problem with this:

' '.join(set(col.split()))

… is that you're calculating the result, but not doing anything with it. If you want to replace line1['Name'] with the result, you have to do this:

outline1['Name'] = ' '.join(set(col.split()))

Meanwhile, set returns values in arbitrary order. So, once you fix that, you're going to end up randomly scrambling the words. Worse, it may appear to work as expected with small sets on your system, but then fail with larger sets or on another machine…

You can use the OrderedSet recipe linked from the collections docs.

However, there's another alternative that seems cleaner: the unique_everseen function from the itertools recipes.

While we're at it, I don't understand why you're doing outline = dict(line1), when line1 is already a dict, and you have no need for two separate copies (normally, you'd want that so you can modify one without affecting other other, but that's not an issue here).

So:

for line1 in read1:
    line1['Name'] = ' '.join(unique_everseen(line1['Name'].split()))
    write1.writerow(line1)

Finally:

I was thinking if there was a method to clear the set between lines, it might work.

You're already creating a new set for each line, so there's nothing to clear.

How to remove duplicate words from same string in csv column

Answers (2)

Related Questions