Chubaka
Chubaka

Reputation: 3155

Improve efficiency in Python matching

If we have the following input and we would like to keep the rows if their "APPID colum " (4rd column) are the same and their column "Category" (the 18th column) are one "Cell" and one "Biochemical" or one "Cell" and one "Enzyme".

A , APPID , C , APP_ID , D , E , F , G , H , I , J , K , L , M , O , P , Q , Category , S , T
,,, APP-1 ,,,,,,,,,,,,,, Cell ,,
,,, APP-1 ,,,,,,,,,,,,,, Enzyme ,,
,,, APP-2 ,,,,,,,,,,,,,, Cell ,,
,,, APP-3 ,,,,,,,,,,,,,, Cell ,,
,,, APP-3 ,,,,,,,,,,,,,, Biochemical ,,

The ideal output will be

A , APPID , C , APP_ID , D , E , F , G , H , I , J , K , L , M , O , P , Q , Category , S , T
,,, APP-1 ,,,,,,,,,,,,,, Enzyme ,,
,,, APP-3 ,,,,,,,,,,,,,, Biochemical ,,
,,, APP-1 ,,,,,,,,,,,,,, Cell ,,
,,, APP-3 ,,,,,,,,,,,,,, Cell ,,

"APP-1" is kept because their column 3 are the same and their Category are one "Cell" and the other one is "Enzyme". The same thing for "APP-3", which has one "Cell" and the other one is "Biochemical" in its "Category" column.

The following attempt could do the trick:

import os

App=["1"]

for a in App:
    outname="App_"+a+"_target_overlap.csv"
    out=open(outname,'w')
    ticker=0
    cell_comp_id=[]
    final_comp_id=[]

    # make compound with cell activity (to a target) list first

    filename="App_"+a+"_target_Detail_average.csv"
    if os.path.exists(filename):
        file = open (filename)
        line=file.readlines()
        if(ticker==0): # Deal with the title
            out.write(line[0])
            ticker=ticker+1

            for c in line[1:]:
                c=c.split(',')
                if(c[17]==" Cell "):
                     cell_comp_id.append(c[3])
    else:
        cell_comp_id=list(set(cell_comp_id))

# while we have list of compounds with cell activity, now we search the Bio and Enz and make one final compound list

    if os.path.exists(filename):

        for c in line[1:]:
            temporary_line=c #for output_temp
            c=c.split(',')
            for comp in cell_comp_id:
                if (c[3]==comp and c[17]==" Biochemical "):
                    final_comp_id.append(comp)
                    out.write(str(temporary_line))
                elif (c[3]==comp and c[17]==" Enzyme "):
                    final_comp_id.append(comp)
                    out.write(str(temporary_line))
    else:
        final_comp_id=list(set(final_comp_id))

# After we obatin a final compound list in target a , we go through all the csv again for output the cell data

    filename="App_"+a+"_target_Detail_average.csv"

    if os.path.exists(filename):

        for c in line[1:]:
            temporary_line=c #for output_temp
            c=c.split(',')
            for final in final_comp_id:
                if (c[3]==final and c[17]==" Cell "):
                    out.write(str(temporary_line))

    out.close()

When the input file is small (tens of thousands lines), this script can finish its job in reasonable time. However, the input files become millions to billions of lines, this script will take forever to finish (days...). I think the issue is we create a list of APPID with "Cell" in the 18th column. Then we go back to compare this "Cell" list (maybe half million lines) to the whole file (1 million lines for example): If any APPID in the Cell list and the whole file is the same and the 18th column of the row in the whole file is "Enzyme" or "Biochemical", we keep the information. This step seems to be very time consuming.

I am thinking maybe preparing "Cell" , "Enzyme" and "Biochemical" dictionaries and compare them will be faster? May I know if any guru has better way to process it? Any example/comment will be helpful. Thanks.

Upvotes: 0

Views: 94

Answers (2)

Daniel
Daniel

Reputation: 1420

reading the file(s) efficiently

One big problem is that you're reading the file all in one go using readlines. This will require loading it ALL into memory at one go. I doubt if you have that much memory available.

Try:

with open(filename) as fh:
    out.write(fh.readline()) # ticker
    for line in fh: #iterate through lines 'lazily', reading as you go.
        c = line.split(',')

style code to start with. This should help a lot. Here, in context:

# make compound with cell activity (to a target) list first

if os.path.exists(filename):
    with open(filename) as fh:
        out.write(fh.readline()) # ticker
        for line in fh:
            cols = line.split(',')
            if cols[17] == " Cell ":
                cell_comp_id.append(cols[3])

the with open(...) as syntax is a very common python idiom which automatically handles closing the file when you finish the with block, or if there is an error. Very useful.

sets

Next thing is, as you suggest, using sets a little better.

You don't need to recreate the set each time, you can just update it to add items. Here's some example set code (written in the python interperter style, >>> at the beginning means it's a line of stuff to type - don't actually type the >>> bit!):

>>> my_set = set()
>>> my_set
set()

>>> my_set.update([1,2,3])
>>> my_set
set([1,2,3])

>>> my_set.update(["this","is","stuff"])
>>> my_set
set([1,2,3,"this","is","stuff"])

>>> my_set.add('apricot')
>>> my_set
set([1,2,3,"this","is","stuff","apricot"])

>>> my_set.remove("is")
>>> my_set
set([1,2,3,"this","stuff","apricot"])

so you can add items, and remove them from a set without creating a new set from scratch (which you are doing each time with the cell_comp_id=list(set(cell_comp_id)) bit.

You can also get differences, intersections, etc:

>>> set(['a','b','c','d']) & set(['c','d','e','f'])
set(['c','d'])

>>> set([1,2,3]) | set([3,4,5])
set([1,2,3,4,5])

See the docs for more info.

So lets try something like:

cells = set()
enzymes = set()
biochemicals = set()

with open(filename) as fh:
    out.write(fh.readline()) #ticker
    for line in fh:
        cols = line.split(',')
        row_id = cols[3]
        row_category = cols[17]

        if row_category == ' Cell ':
            cells.add(row_id)
        elif row_category == ' Biochemical ':
            biochemicals.add(row_id)
        elif row_category == ' Enzyme ':
            enzymes.add(row_id)

Now you have sets of cells, biochemicals and enzymes. You only want the cross section of these, so:

cells_and_enzymes = cells & enzymes
cells_and_biochemicals = cells & biochemicals

You can then go through all the files again and simply check if row_id (or c[3]) is in either of those lists, and if so, print it.

You can actually combine those two lists even further:

cells_with_enz_or_bio = cells_and_enzymes | cells_and_biochemicals

which would be the cells which have enzymes or biochemicals.

So then when you run through the files the second time, you can do:

if row_id in cells_with_enz_or_bio:
    out.write(line)

after all that?

Just using those suggestions might be enough to get you by. You still are storing in memory the entire sets of cells, biochemicals and enzymes, though. And you're still running through the files twice.

So one there are two ways we could potentially speed it up, while still staying with a single python process. I don't know how much memory you have available. If you run out of memory, then it might possibly slow things down slightly.

reducing sets as we go.

If you do have a million records, and 800000 of them are pairs (have a cell record and a biochemical record) then by the time you get to the end of the list, you're storing 800000 IDs in sets. To reduce memory usage, once we've established that we do want to output a record, we could save that information (that we want to print the record) to a file on disk, and stop storing it in memory. Then we could read that list back later to figure out which records to print.

Since this does increase disk IO, it could be slower. But if you are running out of memory, it could reduce swapping, and thus end up faster. It's hard to tell.

with open('to_output.tmp','a') as to_output:
    for a in App:
        # ... do your reading thing into the sets ...

        if row_id in cells and (row_id in biochemicals or row_id in enzymes):
            to_output.write('%s,' % row_id)
            cells.remove(row_id)
            biochemicals.remove(row_id)
            enzymes.remove(row_id)

once you've read through all the files, you now have a file (to_output.tmp) which contains all the ids that you want to keep. So you can read that back into python:

with open('to_output.tmp') as ids_file:
    ids_to_keep = set(ids_file.read().split(','))

which means you can then on your second run through the files simply say:

if row_id in ids_to_keep:
    out.write(line)

using dict instead of sets:

If you have plenty of memory, you could bypass all of that and use dicts for storing the data, which would let you run through the files only once, rather than using sets at all.

cells = {}
enzymes = {}
biochemicals = {}

with open(filename) as fh:
    out.write(fh.readline()) #ticker
    for line in fh:
        cols = line.split(',')
        row_id = cols[3]
        row_category = cols[17]

        if row_category == ' Cell ':
            cells[row_id] = line
        elif row_category == ' Biochemical ':
            biochemicals[row_id] = line
        elif row_category == ' Enzyme ':
            enzymes[row_id] = line

        if row_id in cells and row_id in biochemicals:
            out.write(cells[row_id])
            out.write(biochemicals[row_id])
            if row_id in enzymes:
                out.write(enzymes[row_id])
        elif row_id in cells and row_id in enzymes:
            out.write(cells[row_id])
            out.write(enzymes[row_id])

The problem with this method is that if any rows are duplicated, it will get confused.

If you are sure that the input records are unique, and that they either have enzyme or biochemical records, but not both, then you could easily add del cells[row_id] and the appropriate others to remove rows from the dicts once you've printed them, which would reduce memory usage.

I hope this helps :-)

Upvotes: 3

skrrgwasme
skrrgwasme

Reputation: 9633

A technique I have used to deal with massive files quickly in Python is to use the multiprocessing library to split the file into large chunks, and process those chunks in parallel in worker subprocesses.

Here's the general algorithm:

  1. Based on the amount of memory you have available on the system that will run this script, decide how much of the file you can afford to read into memory at once. The goal is to make the chunks as large as you can without causing thrashing.

  2. Pass the file name and chunk beginning/end positions to subprocesses, which will each open the file, read in and process their sections of the file, and return their results.

Specifically, I like to use a multiprocessing pool, then create a list of chunk start/stop positions, then use the pool.map() function. This will block until everyone has completed, and the results from each subprocess will be available if you catch the return value from the map call.

For example, you could do something like this in your subprocesses:

# assume we have passed in a byte position to start and end at, and a file name:

with open("fi_name", 'r') as fi:
    fi.seek(chunk_start)
    chunk = fi.readlines(chunk_end - chunk_start)

retriever = operator.itemgetter(3, 17) # extracts only the elements we want
APPIDs = {}

for line in chunk:

    ID, category = retriever(line.split())
    try:
        APPIDs[ID].append(category) # we've seen this ID before, add category to its list
    except KeyError:
        APPIDs[ID] = [category] # we haven't seen this ID before - make an entry

# APPIDs entries will look like this:
# 
# <APPID> : [list of categories]

return APPIDs

In your main process, you would retrieve all the returned dictionaries and resolve duplicates or overlaps, then output something like this:

for ID, categories in APPIDs.iteritems():
    if ('Cell' in categories) and ('Biochemical' in categories or 'Enzyme' in categories):
         # print or whatever

A couple of notes/caveats:

  • Pay attention to the load on your hard disk/SSD/wherever your data is located. If your current method is already maxing out its throughput, you probably won't see any performance improvements from this. You can try implementing the same algorithm with threading instead.
  • If you do get a heavy hard disk load that's NOT due to memory thrashing, you can also reduce the number of simultaneous subprocesses you're allowing in the pool. This will result in fewer read requests to the drive, while still taking advantage of truly parallel processing.
  • Look for patterns in your input data you can exploit. For example, if you can rely on matching APPIDs to be next to each other, you can actually do all of your comparisons in the subprocesses and let your main process hang out until its time to combine the subprocess data structures.

TL;DR

Break your file up into chunks and process them in parallel with the multiprocessing library.

Upvotes: 1

Related Questions