Reputation: 465
I have a csv file with many repeated words due to concatenation (that part is out of my control) in the same cell. This usually occurs in the same column. This is an example like what I have:
Name,Geo Location,Default
DRE EXT EXT Pair Video,,
DRE United Kingdom EXT LON Extrane lo.EXT LON RD01,United Kingdom,
DRE United Kingdom EXT LON Extrane lo.EXT LON RD02,United Kingdom,
DRE United Kingdom LON lab dyna test LON,United Kingdom,
DRE United StatesCPT Corp Point Link_Pair Video DRE,United States,
DRE United Kingdom SDD SASD-D TRAIL01 to RD01,United Kingdom,
DRE United Kingdom SDD SASD-D TRAIL01 to RD02 SASD-D,United Kingdom,
DRE United Kingdom SDD SASD-D TRAIL02 to RD01,United Kingdom,
DRE United Kingdom SDD SASD-D TRAIL02 to RD02,United Kingdom,
DRE United Kingdom SDD SASD-D TRAIL01 to TRAIL02,United Kingdom,
DRE United Kingdom SDD SASD-D RD01 to RD02,United Kingdom,
DRE United States MDR SASD-D XC SASD-D Xplay to SASD-D,United States,
DRE Hong Kong (China) Hongkong HKOuter RD01 HKInter,"Hong Kong, Hong Kong",
DRE United Kingdom DRE LON Sq lab dynam test,United Kingdom,
DRE United States USTHA SPS Thalberg usthamd mdf01,United States,
DRE Hong Kong (China)DRE SASD-D Hong Kong Citi SASD-D EXT,Hong Kong,
SASD-D United States SASD-D USPHXCAP VRF SASD-D USPHXCAP RD02,United States,
I need to remove the duplicate words, but only in the same cell.
I have started with the code below which is based on a lot of other question/answers on here about similar subjects. My code is not working and I don't know what to do to make it work or if there is another better method for this.
from csv import DictReader, DictWriter
with open('file1.csv') as fi1,\
open('file2.csv', 'wb') as fout1:
read1 = DictReader(fi1)
write1 = DictWriter(fout1, fieldnames=read1.fieldnames)
write1.writeheader()
for line1 in read1:
col=line1['Name']
outline = dict(line1)
' '.join(set(col.split()))
write1.writerow(outline)
I need help making this work or with another method to make it work. I was thinking if there was a method to clear the set between lines, it might work.
Thanks, B0T
Upvotes: 0
Views: 726
Reputation: 142216
It kind of looks like you want to remove duplicates that consist of capital letters that may also contains numbers or hyphens, so maybe something like:
s = 'SASD-D United States SASD-D USPHXCAP VRF SASD-D USPHXCAP RD02'
import re
print re.sub(r'(\b[-A-Z0-9]{2,}\b)', lambda L, seen=set(): '' if L.group(1) in seen or seen.add(L.group(1)) else L.group(1), s)
# SASD-D United States USPHXCAP VRF RD02
Upvotes: 0
Reputation: 365945
The problem with this:
' '.join(set(col.split()))
… is that you're calculating the result, but not doing anything with it. If you want to replace line1['Name']
with the result, you have to do this:
outline1['Name'] = ' '.join(set(col.split()))
Meanwhile, set
returns values in arbitrary order. So, once you fix that, you're going to end up randomly scrambling the words. Worse, it may appear to work as expected with small sets on your system, but then fail with larger sets or on another machine…
You can use the OrderedSet
recipe linked from the collections
docs.
However, there's another alternative that seems cleaner: the unique_everseen
function from the itertools
recipes.
While we're at it, I don't understand why you're doing outline = dict(line1)
, when line1
is already a dict, and you have no need for two separate copies (normally, you'd want that so you can modify one without affecting other other, but that's not an issue here).
So:
for line1 in read1:
line1['Name'] = ' '.join(unique_everseen(line1['Name'].split()))
write1.writerow(line1)
Finally:
I was thinking if there was a method to clear the set between lines, it might work.
You're already creating a new set for each line, so there's nothing to clear.
Upvotes: 1