Reputation: 704
for example I have two files:
file1:
id chrom start end strand
g1 11 98566330 98566433 -
g2 11 98566295 98566433 -
g3 11 98566581 98566836 -
file2
id chrom start end strand gene_id gene_name somecol1...somecol10
g1 11 98566330 98566433 - ENSMUSG00000017210 Med24
g2 11 98566295 98566433 - ENSMUSG00000017210 Med24
g3 11 98566581 98566836 - ENSMUSG00000017210 Med24
desired output
id chrom start end strand gene_id gene_namesomecol1...somecol10
g1 11 98566330 98566433 - ENSMUSG00000017210 Med24
g2 11 98566295 98566433 - ENSMUSG00000017210 Med24
g3 11 98566581 98566836 - ENSMUSG00000017210 Med24
What I am bascially trying to do is get match id column from both files and if there is a match then print/write some columns from file1 and file2 in a new file ( my current code)
with open('~/outfile.txt', 'w') as w:
for id1 in c1: #c1 is list where i append each line from file1
for id2 in d1: #d2 is list where i append each line from file2
if id1[0] in id2[0]: #is this condition faster (condition1)
# if id1[0] == id2[0]:#or this condition is faster (condition2)
out = ('\t'.join(id2[0:6]),id1[1],id1[2],id2[9],id2[10])
w.write('\t'.join(out) + '\n')
the issue is this code works as desired with condition2 but it is very slow may be because I am trying to match each line id1[0] == id2[0]
between both the list c1 and d1 and also because file2 has like ~500000 rows.
currently i could come up with only two conditions that I am trying to learn that might make the code faster
is there better logic to use that will increase the speed.
EDIT:
I need match file col0 (id) with file2 col(id) and if it is true then slice elements in col0:6, col[1,2] from file1, and col9,10 from file2
desired output
id(file2) chrom(file2) start(file2) end(file2) strand(file2) gene_id(file2) gene_name(file2)somecol1(file1)...somecol10(file1)
g1 11 98566330 98566433 - ENSMUSG00000017210 Med24
g2 11 98566295 98566433 - ENSMUSG00000017210 Med24
g3 11 98566581 98566836 - ENSMUSG00000017210 Med24
Upvotes: 2
Views: 105
Reputation: 140148
If I understand correctly, you want to keep rows from the second file only if the id
field is in the first file.
I would use csv
module all the way, which is cleaner.
First I would build a set of id
fields for fast lookup from file 1 contents (the one with only 5 fields per row)
Then I would read the second file, and write the rows to a third file only if the row id is contained in the set. You'll benefit from the speed of the set
lookup:
import csv
with open("file1.txt") as file1:
cr = csv.reader(file1,delimiter="\t")
next(cr) # skip title
subset = {row[0] for row in cr} # build lookup set in a set comprehension for ids
with open("file2.txt") as file2, open("result.txt","w",newline="") as file3: # python 2: open("result.txt","wb")
cr = csv.reader(file2,delimiter="\t")
cw = csv.writer(file3,delimiter="\t")
cw.writerow(next(cr)) # write title
cw.writerows(row for row in cr if row[0] in subset)
Upvotes: 1
Reputation: 577
You need to know where time is being spent to know how best to fix it. At a guess its going to be reading and writing files and everything else will amount to ... nearly nothing.
Optimization is good if you can focus on the problem areas - that way everything else can remain as clean readable Python. So, I'd start by profiling your code. cProfile is a good place to start your investigation and, possibly, you may need to create some functions to divide your work up so that you can see what is taking time.
I recently did the edX ITMO competitive programming course and speed was very important. Python I/O was a critical barrier so both reading and writing were optimized. Writes were done with significant blocks where possible so you may need to aggregate data before writing. Python's memory mapped reads were used to speed up reading. To give you an example of the relative ease of using mmap, the commented out code at the top performs the mmap equivalent of the uncommented readlines below:
# with mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as mm:
# end = mm.size()
# while mm.tell() < end:
# for l in mm.readline().split():
# print(';'.join(l.decode('ascii').split(',')))
# with open("out1", 'w') as o:
with open(filename, 'r') as f:
for l in f.readlines():
# ll.append('.'.join(l.split(',')))
# o.writelines(l[13:])
# print(l[13:], end='')
print('.'.join(l.split(',')), end='')
Upvotes: 1