P Li
P Li

Reputation: 5222

How to iterate two files with different steps without loading them all in memory using python?

Let's say I have two files A and B

in A I have 100 lines, in B I have 10 lines, I need to do an operation on every 10 lines in A and every 1 line in B.

for example in A, I have following lines a1 a2 ... a10 a11 .. a20 ... a100

in B I have following lines: b1 b2 ... b10

I would like to do an operation on data a1,a2..a10 and b1, and I would like to do the operation again on data a11, a12... a20 and b2.

So the problem is both A and B are very huge, I cannot load them all in memory, so I need to iterate them line by line, but with different speed because 10 lines in A map to 1 line in B. How to do that without preprocessing A to make it with same row size with B?

(I use python 2.7)

Upvotes: 0

Views: 162

Answers (3)

Blckknght
Blckknght

Reputation: 104722

There's a common idiom to iterate over "chunks" using itertools.izip and an unpacked list of references to a single iterator:

from itertools import izip

for values in izip(*[iter]*n): # values will hold n values at a time

For your purpose, you can add a reference to your smaller file along with multiple references to the larger file (files are iterators):

with open("A") as file_a, open("B") as file_b:
    for values in izip(file_b, *[file_a]*10): # values will have one B value and 10 A values
        # do stuff here with the values

If the files might not line up exactly, you can use itertools.izip_longest instead of the regular izip. It will let you supply a default value to use if the inputs don't match up exactly right. In Python 3, the regular builtin zip works like itertools.izip does in Python 2, so you wouldn't need to import anything.

Upvotes: 1

JRG
JRG

Reputation: 4187

You can write some generic code that can print lines from files a.txt and b.txt depending on how many lines they have.

You can either derive it dynamically as below or even hard-code the interval (factor=10) in below code.

NOTE: For my sample execution, a.txt has a1..a33 (33 lines) and b.txt has b1..b3 (3 lines)

import math

## capture all data from files into a list    
alines = [line.rstrip('\n') for line in open('a.txt')]
blines = [line.rstrip('\n') for line in open('b.txt')]

## you can have your own custom factor too
## e.g. 10 as you explain in question
factor = math.ceil(len(alines) / len(blines))
print("Your factor is: " + str(factor))
print("\n")

acount = 0
bcount = 0

## iterate thru all elements of a and b file
for line in range(len(alines) + len(blines)):
    ## print b only when its turn
    if (bcount < len(blines) and (line % (factor+1)) == 0):
        print(blines[bcount])
        bcount = bcount + 1
    else:
        ## print a when its turn
        if (acount < len(alines)):
            print(alines[acount])
            acount = acount + 1

Sample Run

b1
a1
a2
a3
a4
a5
a6
a7
a8
a9
a10
a11
b2
a12
a13
a14
a15
a16
a17
a18
a19
a20
a21
a22
b3
a23
a24
a25
a26
a27
a28
a29
a30
a31
a32
a33

Upvotes: 1

K. Nielson
K. Nielson

Reputation: 191

The standard file.readline method will give you your single line reads. Opening a file object does not load them into memory all at once; the files are buffered and the file position advanced as subsequent readlines are called. In this manner, you could accomplish what you are looking for like so:

def process(a_data, b_data):
    pass  # Your code here

a_data = [None] * 10
with open('pathToFileB', 'r') as fileB:
    with open('pathToFileA', 'r') as fileA:
        for b_data in fileB:
            for i in len(a_data):
                a_data[i] = a.readline()
            process(a_data, b_data)

Of course, this assumes that file A is assured to have 10 lines for every line in file B; when it reaches the end of file A, if there are still lines left in file B, a_data will be filled with None values.

Upvotes: 1

Related Questions