cornerstone
cornerstone

Reputation: 659

Python - remove duplicate lines with specific key

I have around 500k lines which look like the following in a text file (sample snippet) -

1,Party-120273.gif,16256,23ss423
2,Party-120275.gif,16456,23423
3,Party-120273.gif,12656,232423
4,Party-120273.gif,165236,2312423
5,Party-120276.gif,165236,2312423

How do I remove the duplicate occurrences of the lines in the file based on the 2nd value value column. For example in the above lines, remove the duplicate occurrences of lines which contain Party-120273.gif. One the first occurrence should be left undeleted. Hence the output should be -

1,Party-120273.gif,16256,23ss423
2,Party-120275.gif,16456,23423
5,Party-120276.gif,165236,2312423

I have to do this for the entire file, and remove the duplicate lines with repeating values in the 2nd column. How would I do this in python?

Upvotes: 0

Views: 1551

Answers (1)

Aaron Digulla
Aaron Digulla

Reputation: 328624

Does it have to be Python? Why not use sort(1):

sort --field-separator=, --key=2,2 --unique < file

If you still want to do it in Python, look at the csv module to parse the lines:

seenKeys = set()
for row in reader:
    if row[1] in seenKeys: continue

    seenKeys.add( row[1] )
    print ', '.join(row)

Upvotes: 4

Related Questions