Reputation: 5222
Let's say I have two files A and B
in A I have 100 lines, in B I have 10 lines, I need to do an operation on every 10 lines in A and every 1 line in B.
for example in A, I have following lines a1 a2 ... a10 a11 .. a20 ... a100
in B I have following lines: b1 b2 ... b10
I would like to do an operation on data a1,a2..a10 and b1, and I would like to do the operation again on data a11, a12... a20 and b2.
So the problem is both A and B are very huge, I cannot load them all in memory, so I need to iterate them line by line, but with different speed because 10 lines in A map to 1 line in B. How to do that without preprocessing A to make it with same row size with B?
(I use python 2.7)
Upvotes: 0
Views: 162
Reputation: 104722
There's a common idiom to iterate over "chunks" using itertools.izip
and an unpacked list of references to a single iterator:
from itertools import izip
for values in izip(*[iter]*n): # values will hold n values at a time
For your purpose, you can add a reference to your smaller file along with multiple references to the larger file (files are iterators):
with open("A") as file_a, open("B") as file_b:
for values in izip(file_b, *[file_a]*10): # values will have one B value and 10 A values
# do stuff here with the values
If the files might not line up exactly, you can use itertools.izip_longest
instead of the regular izip
. It will let you supply a default value to use if the inputs don't match up exactly right. In Python 3, the regular builtin zip
works like itertools.izip
does in Python 2, so you wouldn't need to import anything.
Upvotes: 1
Reputation: 4187
You can write some generic code that can print lines from files a.txt
and b.txt
depending on how many lines they have.
You can either derive it dynamically as below or even hard-code the interval (factor=10
) in below code.
NOTE
: For my sample execution, a.txt
has a1..a33 (33 lines) and b.txt
has b1..b3 (3 lines)
import math
## capture all data from files into a list
alines = [line.rstrip('\n') for line in open('a.txt')]
blines = [line.rstrip('\n') for line in open('b.txt')]
## you can have your own custom factor too
## e.g. 10 as you explain in question
factor = math.ceil(len(alines) / len(blines))
print("Your factor is: " + str(factor))
print("\n")
acount = 0
bcount = 0
## iterate thru all elements of a and b file
for line in range(len(alines) + len(blines)):
## print b only when its turn
if (bcount < len(blines) and (line % (factor+1)) == 0):
print(blines[bcount])
bcount = bcount + 1
else:
## print a when its turn
if (acount < len(alines)):
print(alines[acount])
acount = acount + 1
Sample Run
b1
a1
a2
a3
a4
a5
a6
a7
a8
a9
a10
a11
b2
a12
a13
a14
a15
a16
a17
a18
a19
a20
a21
a22
b3
a23
a24
a25
a26
a27
a28
a29
a30
a31
a32
a33
Upvotes: 1
Reputation: 191
The standard file.readline
method will give you your single line reads. Opening a file object does not load them into memory all at once; the files are buffered and the file position advanced as subsequent readlines are called. In this manner, you could accomplish what you are looking for like so:
def process(a_data, b_data):
pass # Your code here
a_data = [None] * 10
with open('pathToFileB', 'r') as fileB:
with open('pathToFileA', 'r') as fileA:
for b_data in fileB:
for i in len(a_data):
a_data[i] = a.readline()
process(a_data, b_data)
Of course, this assumes that file A is assured to have 10 lines for every line in file B; when it reaches the end of file A, if there are still lines left in file B, a_data
will be filled with None
values.
Upvotes: 1