Reputation: 33
I have been a frequent lurker on Stack Overflow for some time and I tend to find very useful and clear information from here whenever I have coding questions. However, I can't really seem to find a thread that addresses my specific inquiry today.
Earlier today, I learned about vectorizing functions in Python in order to speed up computing time. I am currently trying to optimize a python program that I had written a little over a month ago. My program takes a text file containing data in the following format:
<magnitude> <dmagnitude> <exposure_number>
I then assign each column to lists mag
, dmag
, and expnum
.
What I want to do is create a 2d array of the mag
and dmag
values that share the same expnum
(having the same exposure number means that the mag
and dmag
point to the same data point).
I do this for all exposure numbers and, at the end, I take the median of the mag
and dmag
, and the standard deviation of the mag
for each of the exposure number-based arrays and combine them all into one array that I can plot.
Currently, I have the following code:
from numpy import loadtxt,array,asarray,append,std,median,empty,take
data = loadtxt(infile,usecols=(0,1,2))
mag = data1[:,2].tolist()
dmag = data1[:,3].tolist()
expnum = data1[:,4].tolist()
#initialize variables
indexing = list()
master_mag = list()
master_dmag = list()
sub_mag = list()
sub_dmag = list()
mag_std = array([])
mag_stdmed = array([])
mag_med = array([])
while len(mag) > 0:
num=expnum[0]
for i in range(0,len(expnum)):
if expnum[i] == num:
sub_mag.append(mag[i])
sub_dmag.append(dmag[i])
indexing.append(i)
#add the sub lists to their master lists
master_mag.append(sub_mag)
master_dmag.append(sub_dmag)
sub_mag=list()
sub_dmag=list()
#remove from mag, dmag, and expnum the index referred to by indexing
while len(indexing) > 0:
mag.pop(indexing[-1])
dmag.pop(indexing[-1])
expnum.pop(indexing[-1])
indexing.pop()
#make the master mag and dmag lists into numpy arrays
master_mag=asarray(master_mag)
master_dmag=asarray(master_dmag)
#generate the mag and dmag median and mag std arrays
for i in range(0,len(master_mag)):
mag_std=append(mag_std,std(master_mag[i]))
mag_med=append(mag_med,median(master_mag[i]))
mag_stdmed=append(mag_stdmed,median(master_dmag[i]))
#create empty numpy arrays to be used for mag med vs. mag std
#and mag med vs. dmag med
med_std=empty([0,2])
med_dmed=empty([0,2])
#fill in those arrays
for i in range(0,len(mag_std)):
med_std=append(med_std,[[mag_med[i],mag_std[i]]],axis=0)
med_dmed=append(med_dmed,[[mag_med[i],mag_stdmed[i]]],axis=0)
#sort the median mag and dmag standard deviation arrays by median mag
order_med_std=med_std[:,0].argsort()
order_med_dmed=med_dmed[:,0].argsort()
sorted_med_std=take(med_std,order_med_std,0)
sorted_med_dmed=take(med_dmed,order_med_dmed,0)
And then I'm ready to plot sorted_med_dmed[:,0]
vs. sorted_med_dmed[:,1]
and sorted_med_std[:,0]
vs. sorted_med_std[:,1]
This code works, it's just that I feel that it is too slow (especially when I get over 10,000 data points to work with). I want to try and vectorize this code to make it much quicker, but I have no idea where to begin.
I would like some help figuring out how to vectorize the matching-by-exposure-number component. I was thinking of creating a multi-dimensional array at the start that has the format: array([[[mag],[dmag]],...])
and a length equal to the number of different exposure numbers. Is there a way to generate and update an array like this in-line, without having to use a ton of loops?
Please let me know if you need any further clarity on what exactly this code is doing.
Thank you for your time.
Upvotes: 1
Views: 237
Reputation: 17797
First step in any problem like this should always be profiling. I recommend trying line_profiler, because it makes it easy to find hotspots visually. (You can also try Python's built-in profiler, but I find its output harder to parse.)
That should give you an idea of which parts are slowing your code down the most. Without trying it myself, I can offer a few bits of advice:
append
s, but slow for most other operations, and they don't support vectorization.numpy.append
if you can. Each call involves allocating more memory and copying, which can be quite slow in a loop.Use dictionaries to group data by keys. I find the stdlib collections.defaultdict
quite useful for grouping like this:
groups = defaultdict(list)
for a,b,key in data:
groups[key].append((a,b))
Use numpy's auto-vectorized function calls instead of calling functions in a loop. For example, this code:
#generate the mag and dmag median and mag std arrays
for i in range(0,len(master_mag)):
mag_std=append(mag_std,std(master_mag[i]))
mag_med=append(mag_med,median(master_mag[i]))
mag_stdmed=append(mag_stdmed,median(master_dmag[i]))
will be much faster when written as:
mag_std = numpy.std(master_mag, axis=0)
mag_meg = numpy.median(master_mag, axis=0)
mag_stdmed = numpy.median(master_dmag, axis=0)
Upvotes: 2