Remove both duplicates (original and duplicate) from text file using python

Question

I try to remove both duplicates like:

STANGHOLMEN_TA02_GT11
STANGHOLMEN_TA02_GT41
STANGHOLMEN_TA02_GT81
STANGHOLMEN_TA02_GT11
STANGHOLMEN_TA02_GT81

Result

STANGHOLMEN_TA02_GT41

I tried this script

lines_seen = set() 
with open(example.txt, "w") as output_file:
    for each_line in open(example2.txt, "r"):
        if each_line not in lines_seen: 
            output_file.write(each_line)
            lines_seen.add(each_line)

But unfortunately, it doesn't work as I want, it misses lines and doesn't remove lines. The original file has spaces every now and then between the lines

Mushif Ali Nawaz · Accepted Answer

You need to do 2 passes for it to work correctly. Because with 1 pass you won't know if the current line will be repeated later or not. You should try something like this:

# count each line occurances
lines_count = {}
for each_line in open('example2.txt', "r"):
    lines_count[each_line] = lines_count.get(each_line, 0) + 1

# write only the lines that are not repeated
with open('example.txt', "w") as output_file:
    for each_line, count in lines_count.items():
        if count == 1:
            output_file.write(each_line)

Remove both duplicates (original and duplicate) from text file using python

Answers (1)

Related Questions