user1995839
user1995839

Reputation: 767

Python and MYSQL performance: Writing a massive number of SQL query results to file

I have a file containing a dictionary of values on each line that I grab and use to query a mysql database using each key as a query. The results of each query get placed in a dict and once all values for the query dict have been generated the line gets written out.

IN > foo bar someotherinfo {1: 'query_val', 2: 'query_val', 3: 'query_val'
OUT > foo bar someotherinfo 1_result 2_result 3_result

This whole process appears to be somewhat slow because I'm performing around 200,000 mysql queries per file and have around 10 files per sample and around 30 samples in total, so I'm looking to speed up the whole process.

I'm just wondering if the fileIO could be creating a bottleneck. Instead of writing the line_info (foo,bar,somblah) followed by each result dict as it's returned, would I be better of chunking these results into memory before writing them to file in batches?

Or is this simply a case of having to just wait it out... ?

Example Input line and output line
INPUT
 XM_006557349.1  1       -       exon    XM_006557349.1_exon_2   10316   10534   {1: 10509:10534', 2: '10488:10508', 3: '10467:10487', 4: '10446:10466', 5: '10425:10445', 6: '10404:10424', 7: '10383:10403', 8: '10362:10382', 9: '10341:10361', 10: '10316:10340'}
OUTPUT
 XM_006557349.1  1       -       exon    XM_006557349.1_exon_2   10316   105340.7083  0.2945  0.2     0.2931  0.125   0.1154  0.2095  0.5833  0.0569  0.0508


CODE
def array_2_meth(sample,bin_type,type,cur_meth):                                        
        bins_in = open('bin_dicts/'+bin_type,'r')
        meth_out = open('meth_data/'+bin_type+'_'+sample+'_plus_'+type+'_meth.tsv','w')                                                                                           
        for line in bins_in.readlines():
                meth_dict = {}                                                                                                                                                                       
                # build array of data from each line
                array = line.strip('\n').split('\t')                                                                                                                             
                mrna_id = array[0]
                assembly = array[1] 
                strand = array[2]
                bin_dict = ast.literal_eval(array[7]) 
                for bin in bin_dict:
                        coords = bin_dict[bin].split(':')
                        start = int(coords[0]) -1
                        end = int(coords[1]) +1
                        cur_meth.execute('select sum(mc)/sum(h) from allc_'+str(sample)+'_'+str(assembly) + ' where strand = \'' +str(strand) +'\' and class = \''+str(type)+'\' and position between '+str(start)+' and ' +str(end) + ' and h >= 5')
                        for row in cur_meth.fetchall():
                                if str(row[0]) == 'None':
                                         meth_dict[bin] = 'no_cov'
                                else:
                                         meth_dict[bin] = float(row[0])
                 meth_out.write('\t'.join(array[:7]))
                 for k in sorted(meth_dict.keys()):
                        meth_out.write('\t'+str(meth_dict[k]))
                 meth_out.write('\n')
        meth_out.close()  

Not sure if adding this code is going to be a massive help, but it should show the way I'm approaching this.. Any advice you could provide on mistakes I'm making in my approach or tips on how to optimise would be greatly appreciated!!!

Thanks ^_^

Upvotes: 0

Views: 185

Answers (1)

Jacky Cheng
Jacky Cheng

Reputation: 1556

I think the fileIO shouldn't take too long, the main bottleneck is probably the amount of queries you are making. But from the example you provide I see no pattern in those start and end position so I have no idea how to cut down the amount of queries you are making.

I have a probably amazing or stupid ideas depending on your test results.(also i don't know shxt about python so ignore the syntax haha)

it SEEMS that every query will only return a single value? maybe you could try something like

SQL = ''
for bin in bin_dict:
    coords = bin_dict[bin].split(':')
    start = int(coords[0]) -1
    end = int(coords[1]) +1
    SQL += 'select sum(mc)/sum(h) from allc_'+str(sample)+'_'+str(assembly) + ' where strand = \'' +str(strand) +'\' and    class = \''+str(type)+'\' and position between '+str(start)+' and ' +str(end) + ' and h >= 5'
    SQL += 'UNION ALL'
    //somehow remove the last UNION ALL at end of loop

cur_meth.execute(str(SQL))
for row in cur_meth.fetchall():
    //loop through the 10 row array and write to file

The core idea is to use UNION ALL to join all queries into 1, and thus you'll only need to do 1 transaction instead of 10 shown in your example. You also reduce the 10 write to file action into 1. The possible drawback is that UNION ALL might be slow, but as far as I know it shouldn't take anymore processing time then 10 individual queries as long as you keep the SQL format in my example.

The second obvious method is to do it multi-thread. if you are not using all your processing power of your machine, you could probably try to start multiple script/program at the same time as all you do is query data and doesn't modify anything. This would cause individual script slightly slower but overall faster as it should reduce wait time between queries.

Upvotes: 1

Related Questions