Reputation: 175
I have two large files. File A looks like:
SNP_A-1780270 rs987435 7 78599583 - C G
SNP_A-1780271 rs345783 15 33395779 - C G
SNP_A-1780272 rs955894 1 189807684 - G T
SNP_A-1780274 rs6088791 20 33907909 - A G
SNP_A-1780277 rs11180435 12 75664046 + C T
SNP_A-1780278 rs17571465 1 218890658 - A T
SNP_A-1780283 rs17011450 4 127630276 - C T
... and has 950,000 lines.
File B looks like:
SNP_A-1780274
SNP_A-1780277
SNP_A-1780278
SNP_A-1780283
SNP_A-1780285
SNP_A-1780286
SNP_A-1780287
... and has 900,000 lines.
I need to find the common elements of file B in file A from column 1 and get an output file like:
SNP_A-1780274 rs6088791 20 33907909 - A G
SNP_A-1780277 rs11180435 12 75664046 + C T
SNP_A-1780278 rs17571465 1 218890658 - A T
SNP_A-1780283 rs17011450 4 127630276 - C T
How can I do it in the most efficient way in Python?
Upvotes: 0
Views: 668
Reputation: 1572
If you can invoke join filea fileb > filec
from your Python code, it will give you what you are looking for.
Upvotes: 0
Reputation: 91029
If File A's lines are long compared to the "key" column 1, you could try this approach:
positions = {}
with open('fileA.txt') as fA:
pos = 0
for lineA in fA:
uid = lineA.split(' ')[0] #gets SNP_A-1780270
positions[uid] = pos
pos += len(lineA)
with open('fileB.txt') as fB, open('fileA.txt') as fA, open('fileC.txt', 'w') as out:
for lineB in fB:
pos = positions[lineB.strip()]
fA.seek(pos)
lineA = fA.readline()
out.write('%s\n', lineA)
You should check if the pos += ...
is more reliable or file.tell()
. I think, as bufferin is involved. file.tell()
doesn't work, but it might be that the pos += ...
needs readjustment as well.
This needs less memory as the dict version, but is probably slower due to the treatment of file A.
Upvotes: 0
Reputation: 14209
I think a dict is ideal:
>>> sa = """SNP_A-1780270 rs987435 7 78599583 - C G
SNP_A-1780271 rs345783 15 33395779 - C G
SNP_A-1780272 rs955894 1 189807684 - G T
SNP_A-1780274 rs6088791 20 33907909 - A G
SNP_A-1780277 rs11180435 12 75664046 + C T
SNP_A-1780278 rs17571465 1 218890658 - A T
SNP_A-1780283 rs17011450 4 127630276 - C T"""
>>> dict_lines = {}
>>> for line in sa.split('\n'):
dict_lines[line.split()[0]] = line
>>> sb = """SNP_A-1780274
SNP_A-1780277
SNP_A-1780278
SNP_A-1780283
SNP_A-1780285
SNP_A-1780286
SNP_A-1780287"""
>>> for val in sb.split('\n'):
line = dict_lines.get(val, None)
if line:
print line
SNP_A-1780274 rs6088791 20 33907909 - A G
SNP_A-1780277 rs11180435 12 75664046 + C T
SNP_A-1780278 rs17571465 1 218890658 - A T
SNP_A-1780283 rs17011450 4 127630276 - C T
Upvotes: 2