Chris
Chris

Reputation: 27

Remove both duplicates (original and duplicate) from text file using python

I try to remove both duplicates like:

STANGHOLMEN_TA02_GT11
STANGHOLMEN_TA02_GT41
STANGHOLMEN_TA02_GT81
STANGHOLMEN_TA02_GT11
STANGHOLMEN_TA02_GT81

Result

STANGHOLMEN_TA02_GT41

I tried this script

lines_seen = set() 
with open(example.txt, "w") as output_file:
    for each_line in open(example2.txt, "r"):
        if each_line not in lines_seen: 
            output_file.write(each_line)
            lines_seen.add(each_line)

But unfortunately, it doesn't work as I want, it misses lines and doesn't remove lines. The original file has spaces every now and then between the lines

Upvotes: 2

Views: 290

Answers (1)

Mushif Ali Nawaz
Mushif Ali Nawaz

Reputation: 3866

You need to do 2 passes for it to work correctly. Because with 1 pass you won't know if the current line will be repeated later or not. You should try something like this:

# count each line occurances
lines_count = {}
for each_line in open('example2.txt', "r"):
    lines_count[each_line] = lines_count.get(each_line, 0) + 1

# write only the lines that are not repeated
with open('example.txt', "w") as output_file:
    for each_line, count in lines_count.items():
        if count == 1:
            output_file.write(each_line)

Upvotes: 2

Related Questions