Reputation: 2047
I have two tab-delimited files, and I need to test every row in the first file against all the rows in the other file. For instance,
file1:
row1 c1 36 345 A
row2 c3 36 9949 B
row3 c4 36 858 C
file2:
row1 c1 3455 3800
row2 c3 6784 7843
row3 c3 10564 99302
row4 c5 1405 1563
let's say I would like to output all the rows in (file1) for which col[3] of file1 is smaller than any (not every) col[2] of file2, given that col[1] are the same.
Expected output:
row1 c1 36 345 A
row2 c3 36 9949 B
Since I am working in Ubuntu, I would like the input command to look like this:
python code.py [file1] [file2] > [output]
I wrote the following code:
import sys
filename1 = sys.argv[1]
filename2 = sys.argv[2]
file1 = open(filename1, 'r')
file2 = open(filename2, 'r')
done = False
for x in file1.readlines():
col = x.strip().split()
for y in file2.readlines():
col2 = y.strip().split()
if col[1] == col2[1] and col[3] < col2[2]:
done = True
break
else: continue
print x
However, the output looks like this:
row2 c3 36 9949 B
This is evident for larger datasets, but basically I always get only the last row for which the condition in the nested loop was true. I am suspecting that "break" is breaking me out of both loops. I would like to know (1) how to break out of only one of the for loops, and (2) if this is the only problem I've got here.
Upvotes: 25
Views: 114221
Reputation: 41
You need to parse the numeric strings to their corresponding integer values.
You can use int('hoge')
as follows.
import sys
filename1 = sys.argv[1]
filename2 = sys.argv[2]
with open(filename1) as file1:
for x in file1:
with open(filename2) as file2:
col = x.strip().split()
for y in file2:
col2 = y.strip().split()
if col[1] == col2[1] and int(col[3]) < int(col2[2]):
print x
Upvotes: 4
Reputation: 500317
break
and continue
apply to the innermost loop.
The issue is that you open the second file only once, and therefore it's only read once. When you execute for y in file2.readlines():
for the second time, file2.readlines()
returns an empty iterable.
Either move file2 = open(filename2, 'r')
into the outer loop, or use seek()
to rewind to the beginning of file2
.
Upvotes: 43