Reputation: 1039
I have an issue which has to do with file input and output in Python (it's a continuation from this question: how to extract specific lines from a data file, which has been solved now).
So I have one big file, danish.train
, and eleven small files (called danish.test.part-01
and so on), each of them containing a different selection of the data from the danish.train
file. Now, for each of the eleven files, I want to create an accompanying file that complements them. This means that for each small file, I want to create a file that contains the contents of danish.train
minus the part that is already in the small file.
What I've come up with so far is this:
trainFile = open("danish.train")
for file_number in range(1,12):
input = open('danish.test.part-%02d' % file_number, 'r')
for line in trainFile:
if line not in input:
with open('danish.train.part-%02d' % file_number, 'a+') as myfile:
myfile.write(line)
The problem is that this code only gives output for file_number 1, although I have a loop from 1-11. If I change the range, for example to in range(2,3)
, I get an output danish.train.part-02
, but this output contains a copy of the whole danish.train
without leaving out the contents of the file danish.test.part-02
, as I wanted.
I suspect that these issues may have something to do with me not completely understanding the with... as
operator, but I'm not sure. Any help would be greatly appreciated.
Upvotes: 1
Views: 61
Reputation: 25042
When you open
a file, it returns an iterator through the lines of the file. This is nice, in that it lets you go through the file, one line at a time, without keeping the whole file into memory at once. In your case, it leads to a problem, in that you need to iterate through the file multiple times.
Instead, you can read the full training file into memory, and go through it multiple times:
with open("danish.train", 'r') as f:
train_lines = f.readlines()
for file_number in range(1, 12):
with open("danish.test.part-%02d" % file_number, 'r') as f:
test_lines = set(f)
with open("danish.train.part-%02d" % file_number, 'w') as g:
g.writelines(line for line in train_lines if line not in test_lines)
I've simplified the logic a little bit, as well. If you don't care about the order of the lines, you could also consider reading the training lines into a set, and then just use set operations instead of the generator expression I used in the final line.
Upvotes: 1