Reputation: 47

Not all duplicates are deleted from a text file in Python

I am new to Python. I am trying to delete duplicates from my text file by doing the following:

line_seen = set()

f = open('a.txt', 'r')
w = open('out.txt', 'w')

for i in f:
    if i not in line_seen:
            w.write(i)
            line_seen.add(i)

f.close()
w.close()

In the initial file I had

hello
world
python
world
hello

And in output file I got

hello
world
python
hello

So it did not remove the last duplicate. Can anyone help me to understand why it happened and how could I fix it?

Upvotes: 2

Answers (4)

gnight

Reputation: 472

Most likely you didn't end the last line with a newline. The known line is `hello\n'. The last just 'hello'

Fix the input or strip() the read i

Upvotes: 0

Leshawn Rice

Reputation: 548

# Since we check if the line exists in lines, we can use a list instead of
# a set to preserve order
lines = []

infile = open('a.txt', 'r')
outfile = open('out.txt', 'w')

# Use the readlines method
for line in infile.readlines():
    if line not in lines:
        # Strip whitespace
        line = line.strip()
        lines.append(line)

for line in lines:
    # Add the whitespace back
    outfile.write("{}\n".format(line))

infile.close()
outfile.close()

Upvotes: -1

Patrick Artner

Reputation: 51683

The first line probably contains 'hello\n' - the last line contains only 'hello' - they are not the same.

Use

line_seen = set()

with  open('a.txt', 'r') as f, open('out.txt', 'w') as w:

    for i in f:
        i = i.strip()            # remove the \n from line
        if i not in line_seen:
            w.write(i + "\n")
            line_seen.add(i)

Upvotes: 3

TheFaultInOurStars

Reputation: 3608

The main problem is with the break line characters ("\n") which appears at the end of each line but the last line. You can use a combination of set, map and join function such as what follows:

f = open('a.txt', 'r')
w = open('out.txt', 'w')
w.write("\n".join(list(set(map(str.strip,f.readlines())))))

out.txt

python
world
hello

If you want to stick to your previous approach you can use:

line_seen = set()

f = open('a.txt', 'r')
w = open('out.txt', 'w')

for i in f:
  i = i.strip()
  if i not in line_seen:
    w.write(i)
    line_seen.add(i)

f.close()
w.close()

Upvotes: 1

Not all duplicates are deleted from a text file in Python

Answers (4)

out.txt

Related Questions