Python - remove duplicate lines with specific key

Question

I have around 500k lines which look like the following in a text file (sample snippet) -

1,Party-120273.gif,16256,23ss423
2,Party-120275.gif,16456,23423
3,Party-120273.gif,12656,232423
4,Party-120273.gif,165236,2312423
5,Party-120276.gif,165236,2312423

How do I remove the duplicate occurrences of the lines in the file based on the 2nd value value column. For example in the above lines, remove the duplicate occurrences of lines which contain Party-120273.gif. One the first occurrence should be left undeleted. Hence the output should be -

1,Party-120273.gif,16256,23ss423
2,Party-120275.gif,16456,23423
5,Party-120276.gif,165236,2312423

I have to do this for the entire file, and remove the duplicate lines with repeating values in the 2nd column. How would I do this in python?

Aaron Digulla · Accepted Answer

Does it have to be Python? Why not use sort(1):

sort --field-separator=, --key=2,2 --unique < file

If you still want to do it in Python, look at the csv module to parse the lines:

seenKeys = set()
for row in reader:
    if row[1] in seenKeys: continue

    seenKeys.add( row[1] )
    print ', '.join(row)

Python - remove duplicate lines with specific key

Answers (1)

Related Questions