list comparison slow code between two files

for example I have two files:

file1:

id chrom start     end       strand
g1  11  98566330    98566433    -
g2  11  98566295    98566433    -
g3  11  98566581    98566836    -

file2

id chrom   start   end      strand  gene_id            gene_name somecol1...somecol10
g1  11  98566330    98566433    -   ENSMUSG00000017210  Med24
g2  11  98566295    98566433    -   ENSMUSG00000017210  Med24
g3  11  98566581    98566836    -   ENSMUSG00000017210  Med24

desired output

id chrom start     end       strand gene_id gene_namesomecol1...somecol10
g1 11  98566330    98566433    -   ENSMUSG00000017210 Med24
g2 11  98566295    98566433    -   ENSMUSG00000017210 Med24
g3 11  98566581    98566836    -   ENSMUSG00000017210 Med24

What I am bascially trying to do is get match id column from both files and if there is a match then print/write some columns from file1 and file2 in a new file ( my current code)

with open('~/outfile.txt', 'w') as w:
     for id1 in c1: #c1 is list where i append each line from file1
         for id2 in d1: #d2 is list where i append each line from file2
             if id1[0] in id2[0]: #is this condition faster (condition1)
         #   if id1[0] == id2[0]:#or this condition is faster (condition2)
                out = ('\t'.join(id2[0:6]),id1[1],id1[2],id2[9],id2[10])
                w.write('\t'.join(out) + '\n')

the issue is this code works as desired with condition2 but it is very slow may be because I am trying to match each line id1[0] == id2[0] between both the list c1 and d1 and also because file2 has like ~500000 rows.

currently i could come up with only two conditions that I am trying to learn that might make the code faster

is there better logic to use that will increase the speed.

EDIT:

I need match file col0 (id) with file2 col(id) and if it is true then slice elements in col0:6, col[1,2] from file1, and col9,10 from file2

desired output

id(file2) chrom(file2) start(file2)     end(file2)       strand(file2) gene_id(file2) gene_name(file2)somecol1(file1)...somecol10(file1)
g1 11  98566330    98566433    -   ENSMUSG00000017210 Med24
g2 11  98566295    98566433    -   ENSMUSG00000017210 Med24
g3 11  98566581    98566836    -   ENSMUSG00000017210 Med24

Upvotes: 2

Views: 105

Answers (2)

Jean-François Fabre
Jean-François Fabre

Reputation: 140148

If I understand correctly, you want to keep rows from the second file only if the id field is in the first file.

I would use csv module all the way, which is cleaner.

First I would build a set of id fields for fast lookup from file 1 contents (the one with only 5 fields per row)

Then I would read the second file, and write the rows to a third file only if the row id is contained in the set. You'll benefit from the speed of the set lookup:

import csv

with open("file1.txt") as file1:
    cr = csv.reader(file1,delimiter="\t")
    next(cr)  # skip title
    subset = {row[0] for row in cr} # build lookup set in a set comprehension for ids

with open("file2.txt") as file2, open("result.txt","w",newline="") as file3:  # python 2: open("result.txt","wb")
    cr = csv.reader(file2,delimiter="\t")
    cw = csv.writer(file3,delimiter="\t")
    cw.writerow(next(cr))  # write title
    cw.writerows(row for row in cr if row[0] in subset)

Upvotes: 1

John 9631
John 9631

Reputation: 577

You need to know where time is being spent to know how best to fix it. At a guess its going to be reading and writing files and everything else will amount to ... nearly nothing.

Optimization is good if you can focus on the problem areas - that way everything else can remain as clean readable Python. So, I'd start by profiling your code. cProfile is a good place to start your investigation and, possibly, you may need to create some functions to divide your work up so that you can see what is taking time.

I recently did the edX ITMO competitive programming course and speed was very important. Python I/O was a critical barrier so both reading and writing were optimized. Writes were done with significant blocks where possible so you may need to aggregate data before writing. Python's memory mapped reads were used to speed up reading. To give you an example of the relative ease of using mmap, the commented out code at the top performs the mmap equivalent of the uncommented readlines below:

    # with mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as mm:
    #     end = mm.size()
    #     while mm.tell() < end:
    #         for l in mm.readline().split():
    #             print(';'.join(l.decode('ascii').split(',')))
    #  with open("out1", 'w') as o:
    with open(filename, 'r') as f:
        for l in f.readlines():
            #  ll.append('.'.join(l.split(',')))
            #  o.writelines(l[13:])
            #  print(l[13:], end='')
            print('.'.join(l.split(',')), end='')

Upvotes: 1

Related Questions