Reputation: 87
I have two files as shown below:
File 1 (tab delimited):
A1 someinfo1 someinfo2 someinfo3 A1 someinfo1 someinfo2 someinfo3 B1 someinfo1 someinfo2 someinfo3 B1 someinfo1 someinfo2 someinfo3
File 2 (tab delimited):
A1 newinfo1 newinfo2 newinfo3 A1 newinfo1 newinfo2 newinfo3 B1 newinfo1 newinfo2 newinfo3 B1 newinfo1 newinfo2 newinfo3
I want to read two lines together (lines starting with A1 and A1) from File 1 and two lines (lines starting with A1 and A1) from File 2. To be more clear, I have two requirements:
1) Reading two lines from the same file 2) Read same two lines from the other file.
To be precise, I want to read four lines together ( 2 consecutive lines from two files (2 lines from each file)).
I searched online and was able to get a code to read two lines together but only from one file.
with open(File1) as file1: for line1,line2 in itertools.izip_longest(*[file1]*2):
Also, I was also able to read one line from each of the two files as:
for i,(line1,line2) in enumerate(itertools.izip(f1,f2)): print line1, line2
But I want to do sth like:
Pseudocode:
for line1, line2 from file1 and line_1 and line_2 from file2: compare line1 with line2 compare line1 with line_1 compare line2 with line_1 compare line2 with line_2
I am hoping a solution to be a linear time one. All the files have same number of lines and the first column (primary id) is same for the consecutive lines within a file and the other file follows the same order (See the above example).
Thanks.
Upvotes: 3
Views: 3263
Reputation: 304473
>>> from itertools import izip
>>> with open("file1") as file1, open("file2") as file2:
... for a1, a2, b1, b2 in izip(file1, file1, file2, file2):
... print a1, a2, b1, b2
...
A1 someinfo1 someinfo2 someinfo3
A1 someinfo1 someinfo2 someinfo3
A1 newinfo1 newinfo2 newinfo3
A1 newinfo1 newinfo2 newinfo3
B1 someinfo1 someinfo2 someinfo3
B1 someinfo1 someinfo2 someinfo3
B1 newinfo1 newinfo2 newinfo3
B1 newinfo1 newinfo2 newinfo3
You can make the number of lines a parameter (n
) like this
for lines in izip(*[file1]*n+[file2]*n):
now lines will be a tuple with n*2
elements
Upvotes: 1
Reputation: 414875
Here's a generalization that allows any number of consecutive lines with the same id column:
from itertools import groupby, izip, product
getid = lambda line: line.partition(" ")[0] # first space-separated column
same_id = lambda lines: groupby(lines, key=getid)
with open(File1) as file1, open(File2) as file2:
for (id1, lines1), (id2, lines2) in izip(same_id(file1), same_id(file2)):
if id1 != id2:
# handle error here
break
# compare all possible combinations
for a, b in product(lines1, lines2):
compare(a, b)
Upvotes: 0
Reputation: 366133
Let's see how we can put these together. First:
with open(File1) as file1:
for line1,line2 in itertools.izip_longest(*[file1]*2):
Well, take out the for
loop and you've got a 2-line-at-a-time iterator over file
, right? So, you can do the same for file2
. And then you can zip
them together:
with open(File1) as file1, open(File2) as file2:
f1 = itertools.izip_longest(*[file1]*2)
f2 = itertools.izip_longest(*[file2]*2)
for i,((f1_line1, f1_line2), (f2_line1, f2_line2)) in enumerate(itertools.izip(f1,f2)):
# do stuff
But you really don't want to do this.
First, most people don't intuitively read izip_longest(*[file1]*2)
and realize that it's grouping by pairs. Wrap that up as a function. In fact, don't even write the function yourself; take grouper
right out of the itertools documentation.
So now, it's:
with open(File1) as file1, open(File2) as file2:
pairs1 = grouper(2, file1)
pairs2 = grouper(2, file2)
for i,((f1_line1, f1_line2), (f2_line1, f2_line2)) in enumerate(itertools.izip(f1,f2)):
# do stuff
Next, pattern-matching may be cool, but a nested pattern to decompose right in the middle of a complicated expression is a little too much. So, let's break it up, and un-nest things by borrowing flatten
from the itertools
docs again:
with open(File1) as file1, open(File2) as file2:
pairs1 = grouper(2, file1)
pairs2 = grouper(2, file2)
zipped_pairs = itertools.izip(pairs1, pairs2)
for i, zipped_pair in enumerate(zipped_pairs):
f1_line1, f1_line2, f2_line1, f2_line2 = flatten(zipped_pair)
# do stuff
The advantage of this solution is that it's abstract and generic, which means if you later decide you need groups of 5 lines, or 3 files, the change is obvious.
The disadvantage of this solution is that it's abstract and generic, which means it can't possibly be as simple as doing the concrete equivalent. (For example, if you didn't zip
up a pair of grouper
s, you wouldn't have to flatten
the result.)
Upvotes: 1
Reputation: 62948
How about this:
with open('a') as A, open('b') as B:
while True:
try:
lineA1, lineA2, lineB1, lineB2 = next(A), next(A), next(B), next(B)
# compare lines
# ...
except StopIteration:
break
Upvotes: 6