kauii8
kauii8

Reputation: 229

Using np.apply_along_axis but on on certain indices

I have a numpy array, X, of shape (200, 200, 1500). I also have a function, func, that essentially returns the mean of an array (it does a few other things but they are are all numpy operations, you can think of it as np.mean). Now if I want to apply this function across the second array I could just do np.apply_along_axis(func, 2, X). But, I also have a truth array of shape (200, 200, 1500). I want to only apply func to places where the truth array has True. So it would ignore any places where the truth array is false. So going back to the np.mean example it would take the mean for each array index across the second axis but ignore some arbitrary set of indices.

So in practice, my solution would be to convertX into a new array Y with shape (200, 200) but the elements of the array are lists. This would be done using the truth array. Then apply func to each list in the array. The problem is this seems very time consuming since and I feel like there is a numpy oriented solution for this. Is there?

If what I said with the array list is the best way, how would I go about combining X and the truth array to get Y?

Any suggestions or comments appreciated.

Upvotes: 0

Views: 635

Answers (1)

hpaulj
hpaulj

Reputation: 231738

In [268]: X = np.random.randint(0,100,(200,200,1500))                                                

Let's check how apply works with just np.mean:

In [269]: res = np.apply_along_axis(np.mean, 2, X)                                                   
In [270]: res.shape                                                                                  
Out[270]: (200, 200)
In [271]: timeit res = np.apply_along_axis(np.mean, 2, X)                                            
1.2 s ± 36.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

An equivalent using iteration on the first two dimensions. I'm using reshape to make it easier to write; speed should be about the same with a double loop.

In [272]: res1 = np.reshape([np.mean(row) for row in X.reshape(-1,1500)],(200,200))                  
In [273]: np.allclose(res, res1)                                                                     
Out[273]: True
In [274]: timeit res1 = np.reshape([np.mean(row) for row in X.reshape(-1,1500)],(200,200))           
906 ms ± 13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So apply may be convenient, but it is not a speed tool.

For speed in numpy you need to maximize the use of compiled code, and avoiding unnecessary python level loops.

In [275]: res2 = np.mean(X,axis=2)                                                                   
In [276]: np.allclose(res2,res)                                                                      
Out[276]: True
In [277]: timeit res2 = np.mean(X,axis=2)                                                            
120 ms ± 619 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

If using apply in your new case is hard, you don't loose anything by using something you do understand.

masked

In [278]: mask = np.random.randint(0,2, X.shape).astype(bool)                                        

The [272] iteration can be adapted to work with mask:

In [279]: resM1 = np.reshape([np.mean(row[m]) for row,m in zip(X.reshape(-1,1500),mask.reshape(-1,150
     ...: 0))],X.shape[:2])                                                                          
In [280]: timeit resM1 = np.reshape([np.mean(row[m]) for row,m in zip(X.reshape(-1,1500),mask.reshape
     ...: (-1,1500))],X.shape[:2])                                                                   
1.43 s ± 18.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This might have problems if row[m] is empty. np.mean([]) produces a warning and nan value.

Applying the mask to X before any further processing looses dimensional information.

In [282]: X[mask].shape                                                                              
Out[282]: (30001416,)

apply only works with one array, so it will be awkward (though not impossible) to use it to iterate on both X and mask. A structured array with data and mask fields might do the job. But the previous timings show, there's no speed advantage.

masked array

I don't usually expect masked arrays to offer speed, but this case it helps:

In [285]: xM = np.ma.masked_array(X, ~mask)                                                          
In [286]: resMM = np.ma.mean(xM, axis=2)                                                             
In [287]: np.allclose(resM1, resMM)                                                                  
Out[287]: True
In [288]: timeit resMM = np.ma.mean(xM, axis=2)                                                      
849 ms ± 20.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

np.nanmean

There's a set of functions that use np.nan masking:

In [289]: Xfloat = X.astype(float)                                                                   
In [290]: Xfloat[~mask] = np.nan                                                                     
In [291]: resflt = np.nanmean(Xfloat, axis=2)                                                        
In [292]: np.allclose(resM1, resflt)                                                                 
Out[292]: True
In [293]: %%timeit 
     ...: Xfloat = X.astype(float) 
     ...: Xfloat[~mask] = np.nan 
     ...: resflt = np.nanmean(Xfloat, axis=2) 
     ...:  
     ...:                                                                                            
2.17 s ± 200 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This doesn't help :(

Upvotes: 1

Related Questions