Iggy
Iggy

Reputation: 33

Vectorizing a Multi-Dimensional Function in Python

I have been a frequent lurker on Stack Overflow for some time and I tend to find very useful and clear information from here whenever I have coding questions. However, I can't really seem to find a thread that addresses my specific inquiry today.

Earlier today, I learned about vectorizing functions in Python in order to speed up computing time. I am currently trying to optimize a python program that I had written a little over a month ago. My program takes a text file containing data in the following format:

<magnitude> <dmagnitude> <exposure_number>

I then assign each column to lists mag, dmag, and expnum.

What I want to do is create a 2d array of the mag and dmag values that share the same expnum (having the same exposure number means that the mag and dmag point to the same data point).

I do this for all exposure numbers and, at the end, I take the median of the mag and dmag, and the standard deviation of the mag for each of the exposure number-based arrays and combine them all into one array that I can plot.

Currently, I have the following code:

from numpy import loadtxt,array,asarray,append,std,median,empty,take

data = loadtxt(infile,usecols=(0,1,2))
mag = data1[:,2].tolist() 
dmag = data1[:,3].tolist() 
expnum = data1[:,4].tolist() 

#initialize variables
indexing = list() 
master_mag = list() 
master_dmag = list() 
sub_mag = list() 
sub_dmag = list() 

mag_std = array([]) 
mag_stdmed = array([]) 
mag_med = array([])  

while len(mag) > 0: 
    num=expnum[0] 
    for i in range(0,len(expnum)): 
        if expnum[i] == num: 
            sub_mag.append(mag[i]) 
            sub_dmag.append(dmag[i]) 
            indexing.append(i) 

    #add the sub lists to their master lists
    master_mag.append(sub_mag) 
    master_dmag.append(sub_dmag) 
    sub_mag=list() 
    sub_dmag=list()

    #remove from mag, dmag, and expnum the index referred to by indexing
    while len(indexing) > 0:    
        mag.pop(indexing[-1]) 
        dmag.pop(indexing[-1]) 
        expnum.pop(indexing[-1]) 
        indexing.pop() 

#make the master mag and dmag lists into numpy arrays 
master_mag=asarray(master_mag) 
master_dmag=asarray(master_dmag) 

#generate the mag and dmag median and mag std arrays 
for i in range(0,len(master_mag)): 
    mag_std=append(mag_std,std(master_mag[i])) 
    mag_med=append(mag_med,median(master_mag[i])) 
    mag_stdmed=append(mag_stdmed,median(master_dmag[i])) 

#create empty numpy arrays to be used for mag med vs. mag std 
#and mag med vs. dmag med 
med_std=empty([0,2]) 
med_dmed=empty([0,2]) 

#fill in those arrays 
for i in range(0,len(mag_std)): 
    med_std=append(med_std,[[mag_med[i],mag_std[i]]],axis=0) 
    med_dmed=append(med_dmed,[[mag_med[i],mag_stdmed[i]]],axis=0) 

#sort the median mag and dmag standard deviation arrays by median mag 
order_med_std=med_std[:,0].argsort() 
order_med_dmed=med_dmed[:,0].argsort() 

sorted_med_std=take(med_std,order_med_std,0) 
sorted_med_dmed=take(med_dmed,order_med_dmed,0) 

And then I'm ready to plot sorted_med_dmed[:,0] vs. sorted_med_dmed[:,1] and sorted_med_std[:,0] vs. sorted_med_std[:,1]

This code works, it's just that I feel that it is too slow (especially when I get over 10,000 data points to work with). I want to try and vectorize this code to make it much quicker, but I have no idea where to begin.

I would like some help figuring out how to vectorize the matching-by-exposure-number component. I was thinking of creating a multi-dimensional array at the start that has the format: array([[[mag],[dmag]],...]) and a length equal to the number of different exposure numbers. Is there a way to generate and update an array like this in-line, without having to use a ton of loops?

Please let me know if you need any further clarity on what exactly this code is doing.

Thank you for your time.

Upvotes: 1

Views: 237

Answers (1)

perimosocordiae
perimosocordiae

Reputation: 17797

First step in any problem like this should always be profiling. I recommend trying line_profiler, because it makes it easy to find hotspots visually. (You can also try Python's built-in profiler, but I find its output harder to parse.)

That should give you an idea of which parts are slowing your code down the most. Without trying it myself, I can offer a few bits of advice:

  • Try to avoid using Python lists when numpy arrays will suffice. Lists are fast if you're doing lots of appends, but slow for most other operations, and they don't support vectorization.
  • Relatedly, try to avoid calling numpy.append if you can. Each call involves allocating more memory and copying, which can be quite slow in a loop.
  • Use dictionaries to group data by keys. I find the stdlib collections.defaultdict quite useful for grouping like this:

    groups = defaultdict(list)
    for a,b,key in data:
      groups[key].append((a,b))
    
  • Use numpy's auto-vectorized function calls instead of calling functions in a loop. For example, this code:

    #generate the mag and dmag median and mag std arrays 
    for i in range(0,len(master_mag)): 
      mag_std=append(mag_std,std(master_mag[i])) 
      mag_med=append(mag_med,median(master_mag[i])) 
      mag_stdmed=append(mag_stdmed,median(master_dmag[i]))
    

    will be much faster when written as:

    mag_std = numpy.std(master_mag, axis=0)
    mag_meg = numpy.median(master_mag, axis=0)
    mag_stdmed = numpy.median(master_dmag, axis=0)
    

Upvotes: 2

Related Questions