Reputation: 659
I have around 500k lines which look like the following in a text file (sample snippet) -
1,Party-120273.gif,16256,23ss423
2,Party-120275.gif,16456,23423
3,Party-120273.gif,12656,232423
4,Party-120273.gif,165236,2312423
5,Party-120276.gif,165236,2312423
How do I remove the duplicate occurrences of the lines in the file based on the 2nd value value column. For example in the above lines, remove the duplicate occurrences of lines which contain Party-120273.gif. One the first occurrence should be left undeleted. Hence the output should be -
1,Party-120273.gif,16256,23ss423
2,Party-120275.gif,16456,23423
5,Party-120276.gif,165236,2312423
I have to do this for the entire file, and remove the duplicate lines with repeating values in the 2nd column. How would I do this in python?
Upvotes: 0
Views: 1551
Reputation: 328624
Does it have to be Python? Why not use sort(1)
:
sort --field-separator=, --key=2,2 --unique < file
If you still want to do it in Python, look at the csv
module to parse the lines:
seenKeys = set()
for row in reader:
if row[1] in seenKeys: continue
seenKeys.add( row[1] )
print ', '.join(row)
Upvotes: 4