Reputation: 1135
I have a question how to reduce the run time.
The code I made is Python.
It takes a huge data set as input, process it, calculate and write output to an array.
Most calculations may be quite simple such as summation. In input file, there are about 100 millions of rows and 3 columns. The problem I faced is so large run time. How to reduce the run time?
Here is the code I wrote.
I need to write all new values (from GenePair to RM_pval with header) I calculated from to new file. Thank you so much in advance.
fi = open ('1.txt')
fo = open ('2.txt','w')
import math
def log(x):
return math.log(x)
from math import sqrt
import sys
sys.path.append('/tools/lib/python2.7/site-packages')
import numpy
import scipy
import numpy as np
from scipy.stats.distributions import norm
for line in fi.xreadlines():
tmp = line.split('\t')
GenePair = tmp[0].strip()
PCC_A = float(tmp[1].strip())
PCC_B = float(tmp[2].strip())
ZVAL_A = 0.5 * log((1+PCC_A)/(1-PCC_A))
ZVAL_B = 0.5 * log((1+PCC_B)/(1-PCC_B))
ABS_ZVAL_A = abs(ZVAL_A)
ABS_ZVAL_B = abs(ZVAL_B)
Var_A = float(1) / float(21-3) #SAMPLESIZE - 3
Var_B = float(1) / float(18-3) #SAMPLESIZE - 3
WT_A = 1/Var_A #float
WT_B = 1/Var_B #float
ZVAL_A_X_WT_A = ZVAL_A * WT_A #float
ZVAL_B_X_WT_B = ZVAL_B * WT_B #float
SumofWT = (WT_A + WT_B) #float
SumofZVAL_X_WT = (ZVAL_A_X_WT_A + ZVAL_B_X_WT_B) #float
#FIXED MODEL
meanES = SumofZVAL_X_WT / SumofWT #float
Var = float(1) / SumofWT #float
SE = math.sqrt(float(Var)) #float
LL = meanES - (1.96 * SE) #float
UL = meanES - (1.96 * SE) #float
z_score = meanES / SE #float
p_val = scipy.stats.norm.sf(z_score)
#CAL
ES_POWER_X_WT_A = pow(ZVAL_A,2) * WT_A #float
ES_POWER_X_WT_B = pow(ZVAL_B,2) * WT_B #float
WT_POWER_A = pow(WT_A,2)
WT_POWER_B = pow(WT_B,2)
SumofES_POWER_X_WT = ES_POWER_X_WT_A + ES_POWER_X_WT_B
SumofWT_POWER = WT_POWER_A + WT_POWER_B
#COMPUTE TAU
tmp_A = ZVAL_A - meanES
tmp_B = ZVAL_B - meanES
temp = pow(SumofZVAL_X_WT,2)
Q = SumofES_POWER_X_WT - (temp /(SumofWT))
if PCC_A !=0 or PCC_B !=0:
df = 0
else:
df = 1
c = SumofWT - ((pow(SumofWT,2))/SumofWT)
if c == 0:
tau_square = 0
else:
tau_square = (Q - df) / c
#calculation
Var_total_A = Var_A + tau_square
Var_total_B = Var_B + tau_square
WT_total_A = float(1) / Var_total_A
WT_total_B = float(1) / Var_total_B
ZVAL_X_WT_total_A = ZVAL_A * WT_total_A
ZVAL_X_WT_total_B = ZVAL_B * WT_total_B
Sumoftotal_WT = WT_total_A + WT_total_B
Sumoftotal_ZVAL_X_WT= ZVAL_X_WT_total_A + ZVAL_X_WT_total_B
#RANDOM MODEL
RM_meanES = Sumoftotal_ZVAL_X_WT / Sumoftotal_WT
RM_Var = float(1) / Sumoftotal_WT
RM_SE = math.sqrt(float(RM_Var))
RM_LL = RM_meanES - (1.96 * RM_SE)
RM_UL = RM_meanES + (1.96 * RM_SE)
RM_z_score = RM_meanES / RM_Var
RM_p_val = scipy.stats.norm.sf(RM_z_score)
Upvotes: 1
Views: 2600
Reputation: 7806
Definitely do the profiler thing but... I think the only major speedup will happen due to parallelism. Taking advantage of multiple cores is of paramount importance if you are going to run cpu bound problems like this. Try putting each line through a different (thread/process). This raises more questions of course, for example does the data need to be in the same order from the input file? if so just enumerate that and stick a second variable on the big_hairy_func for which line it will be.
here is some boilerplate code to get started
notes:
xreadlines is deprecated even though it deals with large files for line in file:
replaces it.
fi = open('1.txt')
fo = open('2.txt','w')
import math
def log(x):
return math.log(x)
from math import sqrt
import multiprocessing as mp
import sys
sys.path.append('/tools/lib/python2.7/site-packages')
import scipy
import numpy as np
from scipy.stats.distributions import norm
def big_hairy_func(linefromfile):
<majority of your post here>
return <whatever data you were going to write to 'fo'>
if __name__ == '__main__':
pool = mp.Pool(4) #rule of thumb. Replace '4' with the number of cores on your system
result = pool.map(big_hairy_func, (input for input in fi.readlines()))
<write the result to fo that you haven't posted>
xreadlines was deprecated in python 2.3 so with that version I'm not sure if the generator function will work. Let me know if you have questions about compatibility with your version of python.
Upvotes: 2