Reputation: 47
I am new to Python. I am trying to delete duplicates from my text file by doing the following:
line_seen = set()
f = open('a.txt', 'r')
w = open('out.txt', 'w')
for i in f:
if i not in line_seen:
w.write(i)
line_seen.add(i)
f.close()
w.close()
In the initial file I had
hello
world
python
world
hello
And in output file I got
hello
world
python
hello
So it did not remove the last duplicate. Can anyone help me to understand why it happened and how could I fix it?
Upvotes: 2
Views: 54
Reputation: 472
Most likely you didn't end the last line with a newline. The known line is `hello\n'. The last just 'hello'
Fix the input or strip() the read i
Upvotes: 0
Reputation: 548
# Since we check if the line exists in lines, we can use a list instead of
# a set to preserve order
lines = []
infile = open('a.txt', 'r')
outfile = open('out.txt', 'w')
# Use the readlines method
for line in infile.readlines():
if line not in lines:
# Strip whitespace
line = line.strip()
lines.append(line)
for line in lines:
# Add the whitespace back
outfile.write("{}\n".format(line))
infile.close()
outfile.close()
Upvotes: -1
Reputation: 51683
The first line probably contains 'hello\n'
- the last line contains only 'hello'
- they are not the same.
Use
line_seen = set()
with open('a.txt', 'r') as f, open('out.txt', 'w') as w:
for i in f:
i = i.strip() # remove the \n from line
if i not in line_seen:
w.write(i + "\n")
line_seen.add(i)
Upvotes: 3
Reputation: 3608
The main problem is with the break line characters ("\n") which appears at the end of each line but the last line. You can use a combination of set
, map
and join
function such as what follows:
f = open('a.txt', 'r')
w = open('out.txt', 'w')
w.write("\n".join(list(set(map(str.strip,f.readlines())))))
python
world
hello
If you want to stick to your previous approach you can use:
line_seen = set()
f = open('a.txt', 'r')
w = open('out.txt', 'w')
for i in f:
i = i.strip()
if i not in line_seen:
w.write(i)
line_seen.add(i)
f.close()
w.close()
Upvotes: 1