Johanna
Johanna

Reputation: 1039

File output based on the contents of another file

I have an issue which has to do with file input and output in Python (it's a continuation from this question: how to extract specific lines from a data file, which has been solved now).

So I have one big file, danish.train, and eleven small files (called danish.test.part-01 and so on), each of them containing a different selection of the data from the danish.train file. Now, for each of the eleven files, I want to create an accompanying file that complements them. This means that for each small file, I want to create a file that contains the contents of danish.train minus the part that is already in the small file.

What I've come up with so far is this:

trainFile = open("danish.train")

for file_number in range(1,12):
    input = open('danish.test.part-%02d' % file_number, 'r')

    for line in trainFile:
        if line not in input:
            with open('danish.train.part-%02d' % file_number, 'a+') as myfile:
                myfile.write(line)

The problem is that this code only gives output for file_number 1, although I have a loop from 1-11. If I change the range, for example to in range(2,3), I get an output danish.train.part-02, but this output contains a copy of the whole danish.train without leaving out the contents of the file danish.test.part-02, as I wanted.

I suspect that these issues may have something to do with me not completely understanding the with... as operator, but I'm not sure. Any help would be greatly appreciated.

Upvotes: 1

Views: 61

Answers (1)

Michael J. Barber
Michael J. Barber

Reputation: 25042

When you open a file, it returns an iterator through the lines of the file. This is nice, in that it lets you go through the file, one line at a time, without keeping the whole file into memory at once. In your case, it leads to a problem, in that you need to iterate through the file multiple times.

Instead, you can read the full training file into memory, and go through it multiple times:

with open("danish.train", 'r') as f:
    train_lines = f.readlines()

for file_number in range(1, 12):
    with open("danish.test.part-%02d" % file_number, 'r') as f:
        test_lines = set(f)
    with open("danish.train.part-%02d" % file_number, 'w') as g:
        g.writelines(line for line in train_lines if line not in test_lines)

I've simplified the logic a little bit, as well. If you don't care about the order of the lines, you could also consider reading the training lines into a set, and then just use set operations instead of the generator expression I used in the final line.

Upvotes: 1

Related Questions