Reputation:
I am trying to create a simple program that removes duplicate lines from a file. However, I am stuck. My goal is to ultimately remove all except 1 duplicate line, different from the suggested duplicate. So, I still have that data. I would also like to make it so, it takes in the same filename and outputs the same filename. When I tried to make the filenames both the same, it just outputs an empty file.
input_file = "input.txt"
output_file = "input.txt"
seen_lines = set()
outfile = open(output_file, "w")
for line in open(input_file, "r"):
if line not in seen_lines:
outfile.write(line)
seen_lines.add(line)
outfile.close()
input.txt
I really love christmas
Keep the change ya filthy animal
Pizza is my fav food
Keep the change ya filthy animal
Did someone say peanut butter?
Did someone say peanut butter?
Keep the change ya filthy animal
Expected output
I really love christmas
Keep the change ya filthy animal
Pizza is my fav food
Did someone say peanut butter?
Upvotes: 13
Views: 1332
Reputation: 2526
Just my two cents, in case you happen to be able to use Python3. It uses:
Path
object which has a handy write_text()
method.OrderedDict
as data structure to satisfy the constraints of uniqueness and order at once.Path.read_text()
to save on memory.# in-place removal of duplicate lines, while remaining order
import os
from collections import OrderedDict
from pathlib import Path
filepath = Path("./duplicates.txt")
with filepath.open() as _file:
no_duplicates = OrderedDict.fromkeys(line.rstrip('\n') for line in _file)
filepath.write_text("\n".join(no_duplicates))
Upvotes: 0
Reputation: 8205
import os
seen_lines = []
with open('input.txt','r') as infile:
lines=infile.readlines()
for line in lines:
line_stripped=line.strip()
if line_stripped not in seen_lines:
seen_lines.append(line_stripped)
with open('input.txt','w') as outfile:
for line in seen_lines:
outfile.write(line)
if line != seen_lines[-1]:
outfile.write(os.linesep)
Output:
I really love christmas
Keep the change ya filthy animal
Pizza is my fav food
Did someone say peanut butter?
Upvotes: 1
Reputation: 114230
The line outfile = open(output_file, "w")
truncates your file no matter what else you do. The reads that follow will find an empty file. My recommendation for doing this safely is to use a temporary file:
This is much more robust than opening the file twice for reading and writing. If anything goes wrong, you will have the original and whatever work you did so far stashed away. Your current approach can mess up your file if anything goes wrong in the process.
Here is a sample using tempfile.NamedTemporaryFile
, and a with
block to make sure everything is closed properly, even in case of error:
from tempfile import NamedTemporaryFile
from shutil import move
input_file = "input.txt"
output_file = "input.txt"
seen_lines = set()
with NamedTemporaryFile('w', delete=False) as output, open(input_file) as input:
for line in open(input_file, "r"):
sline = line.rstrip('\n')
if sline not in seen_lines:
output.write(line)
seen_lines.add(sline)
move(output.name, output_file)
The move
at the end will work correctly even if the input and output names are the same, since output.name
is guaranteed to be something different from both.
Note also that I'm stripping the newline from each line in the set, since the last line might not have one.
Alt Solution
If your don't care about the order of the lines, you can simplify the process somewhat by doing everything directly in memory:
input_file = "input.txt"
output_file = "input.txt"
with open(input_file) as input:
unique = set(line.rstrip('\n') for line in input)
with open(output_file, 'w') as output:
for line in unique:
output.write(line)
output.write('\n')
You can compare this against
with open(input_file) as input:
unique = set(line.rstrip('\n') for line in input.readlines())
with open(output_file, 'w') as output:
output.write('\n'.join(unique))
The second version does exactly the same thing, but loads and writes all at once.
Upvotes: 6
Reputation: 71560
Try the below code, using list comprehension with str.join
and set
and sorted
:
input_file = "input.txt"
output_file = "input.txt"
seen_lines = []
outfile = open(output_file, "w")
infile = open(input_file, "r")
l = [i.rstrip() for i in infile.readlines()]
outfile.write('\n'.join(sorted(set(l,key=l.index))))
outfile.close()
Upvotes: 0
Reputation: 13
I believe this is the easiest way to do what you want:
with open('FileName.txt', 'r+') as i:
AllLines = i.readlines()
for line in AllLines:
#write to file
Upvotes: 0
Reputation: 12571
The problem is that you're trying to write to the same file that you're reading from. You have at least two options:
Use different filenames (e.g. input.txt and output.txt). This is, at some level, easiest.
Read all data in from your input file, close that file, then open the file for writing.
with open('input.txt', 'r') as f:
lines = f.readlines()
seen_lines = set()
with open('input.txt', 'w') as f:
for line in lines:
if line not in seen_lines:
seen_lines.add(line)
f.write(line)
Open the file for both reading and writing using r+
mode. You need to be careful in this case to read the data you're going to process before writing. If you do everything in a single loop, the loop iterator may lose track.
Upvotes: 4