Improve efficiency in Python matching

Question

If we have the following input and we would like to keep the rows if their "APPID colum " (4rd column) are the same and their column "Category" (the 18th column) are one "Cell" and one "Biochemical" or one "Cell" and one "Enzyme".

A , APPID , C , APP_ID , D , E , F , G , H , I , J , K , L , M , O , P , Q , Category , S , T
,,, APP-1 ,,,,,,,,,,,,,, Cell ,,
,,, APP-1 ,,,,,,,,,,,,,, Enzyme ,,
,,, APP-2 ,,,,,,,,,,,,,, Cell ,,
,,, APP-3 ,,,,,,,,,,,,,, Cell ,,
,,, APP-3 ,,,,,,,,,,,,,, Biochemical ,,

The ideal output will be

A , APPID , C , APP_ID , D , E , F , G , H , I , J , K , L , M , O , P , Q , Category , S , T
,,, APP-1 ,,,,,,,,,,,,,, Enzyme ,,
,,, APP-3 ,,,,,,,,,,,,,, Biochemical ,,
,,, APP-1 ,,,,,,,,,,,,,, Cell ,,
,,, APP-3 ,,,,,,,,,,,,,, Cell ,,

"APP-1" is kept because their column 3 are the same and their Category are one "Cell" and the other one is "Enzyme". The same thing for "APP-3", which has one "Cell" and the other one is "Biochemical" in its "Category" column.

The following attempt could do the trick:

import os

App=["1"]

for a in App:
    outname="App_"+a+"_target_overlap.csv"
    out=open(outname,'w')
    ticker=0
    cell_comp_id=[]
    final_comp_id=[]

    # make compound with cell activity (to a target) list first

    filename="App_"+a+"_target_Detail_average.csv"
    if os.path.exists(filename):
        file = open (filename)
        line=file.readlines()
        if(ticker==0): # Deal with the title
            out.write(line[0])
            ticker=ticker+1

            for c in line[1:]:
                c=c.split(',')
                if(c[17]==" Cell "):
                     cell_comp_id.append(c[3])
    else:
        cell_comp_id=list(set(cell_comp_id))

# while we have list of compounds with cell activity, now we search the Bio and Enz and make one final compound list

    if os.path.exists(filename):

        for c in line[1:]:
            temporary_line=c #for output_temp
            c=c.split(',')
            for comp in cell_comp_id:
                if (c[3]==comp and c[17]==" Biochemical "):
                    final_comp_id.append(comp)
                    out.write(str(temporary_line))
                elif (c[3]==comp and c[17]==" Enzyme "):
                    final_comp_id.append(comp)
                    out.write(str(temporary_line))
    else:
        final_comp_id=list(set(final_comp_id))

# After we obatin a final compound list in target a , we go through all the csv again for output the cell data

    filename="App_"+a+"_target_Detail_average.csv"

    if os.path.exists(filename):

        for c in line[1:]:
            temporary_line=c #for output_temp
            c=c.split(',')
            for final in final_comp_id:
                if (c[3]==final and c[17]==" Cell "):
                    out.write(str(temporary_line))

    out.close()

When the input file is small (tens of thousands lines), this script can finish its job in reasonable time. However, the input files become millions to billions of lines, this script will take forever to finish (days...). I think the issue is we create a list of APPID with "Cell" in the 18th column. Then we go back to compare this "Cell" list (maybe half million lines) to the whole file (1 million lines for example): If any APPID in the Cell list and the whole file is the same and the 18th column of the row in the whole file is "Enzyme" or "Biochemical", we keep the information. This step seems to be very time consuming.

I am thinking maybe preparing "Cell" , "Enzyme" and "Biochemical" dictionaries and compare them will be faster? May I know if any guru has better way to process it? Any example/comment will be helpful. Thanks.

We use python 2.7.6.

Daniel · Accepted Answer

reading the file(s) efficiently

One big problem is that you're reading the file all in one go using readlines. This will require loading it ALL into memory at one go. I doubt if you have that much memory available.

Try:

with open(filename) as fh:
    out.write(fh.readline()) # ticker
    for line in fh: #iterate through lines 'lazily', reading as you go.
        c = line.split(',')

style code to start with. This should help a lot. Here, in context:

# make compound with cell activity (to a target) list first

if os.path.exists(filename):
    with open(filename) as fh:
        out.write(fh.readline()) # ticker
        for line in fh:
            cols = line.split(',')
            if cols[17] == " Cell ":
                cell_comp_id.append(cols[3])

the with open(...) as syntax is a very common python idiom which automatically handles closing the file when you finish the with block, or if there is an error. Very useful.

sets

Next thing is, as you suggest, using sets a little better.

You don't need to recreate the set each time, you can just update it to add items. Here's some example set code (written in the python interperter style, >>> at the beginning means it's a line of stuff to type - don't actually type the >>> bit!):

>>> my_set = set()
>>> my_set
set()

>>> my_set.update([1,2,3])
>>> my_set
set([1,2,3])

>>> my_set.update(["this","is","stuff"])
>>> my_set
set([1,2,3,"this","is","stuff"])

>>> my_set.add('apricot')
>>> my_set
set([1,2,3,"this","is","stuff","apricot"])

>>> my_set.remove("is")
>>> my_set
set([1,2,3,"this","stuff","apricot"])

so you can add items, and remove them from a set without creating a new set from scratch (which you are doing each time with the cell_comp_id=list(set(cell_comp_id)) bit.

You can also get differences, intersections, etc:

>>> set(['a','b','c','d']) & set(['c','d','e','f'])
set(['c','d'])

>>> set([1,2,3]) | set([3,4,5])
set([1,2,3,4,5])

See the docs for more info.

So lets try something like:

cells = set()
enzymes = set()
biochemicals = set()

with open(filename) as fh:
    out.write(fh.readline()) #ticker
    for line in fh:
        cols = line.split(',')
        row_id = cols[3]
        row_category = cols[17]

        if row_category == ' Cell ':
            cells.add(row_id)
        elif row_category == ' Biochemical ':
            biochemicals.add(row_id)
        elif row_category == ' Enzyme ':
            enzymes.add(row_id)

Now you have sets of cells, biochemicals and enzymes. You only want the cross section of these, so:

cells_and_enzymes = cells & enzymes
cells_and_biochemicals = cells & biochemicals

You can then go through all the files again and simply check if row_id (or c[3]) is in either of those lists, and if so, print it.

You can actually combine those two lists even further:

cells_with_enz_or_bio = cells_and_enzymes | cells_and_biochemicals

which would be the cells which have enzymes or biochemicals.

So then when you run through the files the second time, you can do:

if row_id in cells_with_enz_or_bio:
    out.write(line)

after all that?

Just using those suggestions might be enough to get you by. You still are storing in memory the entire sets of cells, biochemicals and enzymes, though. And you're still running through the files twice.

So one there are two ways we could potentially speed it up, while still staying with a single python process. I don't know how much memory you have available. If you run out of memory, then it might possibly slow things down slightly.

reducing sets as we go.

If you do have a million records, and 800000 of them are pairs (have a cell record and a biochemical record) then by the time you get to the end of the list, you're storing 800000 IDs in sets. To reduce memory usage, once we've established that we do want to output a record, we could save that information (that we want to print the record) to a file on disk, and stop storing it in memory. Then we could read that list back later to figure out which records to print.

Since this does increase disk IO, it could be slower. But if you are running out of memory, it could reduce swapping, and thus end up faster. It's hard to tell.

with open('to_output.tmp','a') as to_output:
    for a in App:
        # ... do your reading thing into the sets ...

        if row_id in cells and (row_id in biochemicals or row_id in enzymes):
            to_output.write('%s,' % row_id)
            cells.remove(row_id)
            biochemicals.remove(row_id)
            enzymes.remove(row_id)

once you've read through all the files, you now have a file (to_output.tmp) which contains all the ids that you want to keep. So you can read that back into python:

with open('to_output.tmp') as ids_file:
    ids_to_keep = set(ids_file.read().split(','))

which means you can then on your second run through the files simply say:

if row_id in ids_to_keep:
    out.write(line)

using `dict` instead of sets:

If you have plenty of memory, you could bypass all of that and use dicts for storing the data, which would let you run through the files only once, rather than using sets at all.

cells = {}
enzymes = {}
biochemicals = {}

with open(filename) as fh:
    out.write(fh.readline()) #ticker
    for line in fh:
        cols = line.split(',')
        row_id = cols[3]
        row_category = cols[17]

        if row_category == ' Cell ':
            cells[row_id] = line
        elif row_category == ' Biochemical ':
            biochemicals[row_id] = line
        elif row_category == ' Enzyme ':
            enzymes[row_id] = line

        if row_id in cells and row_id in biochemicals:
            out.write(cells[row_id])
            out.write(biochemicals[row_id])
            if row_id in enzymes:
                out.write(enzymes[row_id])
        elif row_id in cells and row_id in enzymes:
            out.write(cells[row_id])
            out.write(enzymes[row_id])

The problem with this method is that if any rows are duplicated, it will get confused.

If you are sure that the input records are unique, and that they either have enzyme or biochemical records, but not both, then you could easily add del cells[row_id] and the appropriate others to remove rows from the dicts once you've printed them, which would reduce memory usage.

I hope this helps :-)

Improve efficiency in Python matching

Answers (2)

reading the file(s) efficiently

sets

after all that?

reducing sets as we go.

using `dict` instead of sets:

TL;DR

Related Questions

Improve efficiency in Python matching

Answers (2)

reading the file(s) efficiently

sets

after all that?

reducing sets as we go.

using dict instead of sets:

TL;DR

Related Questions

using `dict` instead of sets: