adir abargil
adir abargil

Reputation: 5745

numpy/pandas vectorize custom for loop

I created some example code that mimic what code I got:

import numpy as np 

arr = np.random.random(100)
arr2 = np.linspace(0, 1, 20)
arr3 = np.zeros(20) # this is the array i want to store the result in
for index, num in enumerate(list(arr2)):
    arr3[index] = np.mean(arr[np.abs(num - arr) < 0.2])

>>> arr3
array([0.10970893, 0.1132479 , 0.14687451, 0.17257954, 0.19401919,
       0.23852137, 0.29151448, 0.35715096, 0.43273118, 0.45800796,
       0.52940421, 0.60345354, 0.63969432, 0.67656363, 0.72921913,
       0.78330793, 0.82693675, 0.83717402, 0.86651827, 0.89782569])

My problem is this code runs on much larger data. I want to know if it is possible to make some combination of numpy or pandas to do it in a vectorized way without using explicit loops. I tried many approaches but don't have something in my mind.

Upvotes: 2

Views: 245

Answers (3)

Mad Physicist
Mad Physicist

Reputation: 114488

I would recommend a completely different approach if you are dealing with large arrays. Right now, you are searching through the entirety of arr for every element in arr2. This is clearly overkill. Instead, you can operate on a sorted arr and simply sum between the insertion points obtained from np.searchsorted.

Sort arr in place if you can:

arr.sort()

You know the width of your interval, so find the bounding values. I'm making the array shaped (20, 2) to match the bounds more easily:

bounds = arr2.reshape(-1, 1) + [-0.2, 0.2]

Now find the insertion indices:

ind = np.searchsorted(arr, bounds)

ind is the same shape as bounds. ind[i, :] are the start (inclusive) and end (exclusive) indices into arr that correspond to the ith element of arr2. In other words, for any given i, arr3[i] in the original question is arr[ind[i, 0]:ind[i, 1]].mean(). You can use this directly for a non-vectorized solution:

result = np.array([arr[slice(*i)].mean() for i in ind])

There are a couple of ways to vectorize the solution. In either case, you'll want the number of elements in each run:

n = np.diff(ind, axis=1).ravel()

A quick and dirty solution that is prone to rounding errors uses np.cumsum and fancy indexing using ind:

cumulative = np.r_[0, np.cumsum(arr)]
sums = np.diff(cumulative[ind], axis=1).ravel()
result = sums / n

A more robust solution would extract only the sums that you actually need using np.add.reduceat:

arr = np.r_[arr, 0]  # compensate for index past the end
sums = np.add.reduceat(arr, ind.ravel())[::2]
result = sums / n

You can compare the results of both approaches to arr3 computed in the question to verify that the second approach is noticeably more accurate, even with your toy example.

Timing

def original(arr, arr2, d):
    arr3 = np.empty_like(arr2)
    for index, num in enumerate(arr2):
        arr3[index] = np.mean(arr[np.abs(num - arr) < d])
    return arr3

def ananda(arr, arr2, d):
    arr_tile = np.tile(arr, (len(arr2), 1))
    arr_tile[np.abs(arr - arr2[:, None]) >= d] = np.nan
    return np.nanmean(arr_tile, axis=1)

def mad_0(arr, arr2, d):
    arr.sort()
    ind = np.searchsorted(arr, arr2.reshape(-1, 1) + [-d, d])
    return np.array([arr[slice(*i)].mean() for i in ind])

def mad_1(arr, arr2, d):
    arr.sort()
    ind = np.searchsorted(arr, arr2.reshape(-1, 1) + [-d, d])
    n = np.diff(ind, axis=1).ravel()
    sums = np.diff(np.r_[0, np.cumsum(arr)][ind], axis=1).ravel()
    return sums / n

def mad_2(arr, arr2, d):
    arr.sort()
    ind = np.searchsorted(arr, arr2.reshape(-1, 1) + [-d, d])
    n = np.diff(ind, axis=1).ravel()
    arr = np.r_[arr, 0]
    sums = np.add.reduceat(arr, ind.ravel())[::2]
    return sums / n

Inputs (reset for each run):

np.random.seed(42)
arr = np.random.rand(100)
arr2 = np.linspace(0, 1, 1000)

Results:

original: 25.5 ms ± 278 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
  ananda: 2.66 ms ± 35.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
   mad_0: 14.5 ms ± 48.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
   mad_1:  211 µs ± 1.41 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
   mad_2:  242 µs ± 1.93 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

For 100 elements with 1k bins, the original approach is ~10x slower than using np.tile. Using a list comprehension is only 2x better than the original approach. While the np.cumsum approach appears to be a little faster than np.add.reduce, it is potentially not as numerically stable.

Another advantage of using the methods I suggest is that you can change arr2 arbitrarily, while arr only needs to be sorted once.

Upvotes: 4

Alain T.
Alain T.

Reputation: 42143

Unless arr2 is larger than arr, you will get very little performance improvement (if any) by vectorizing that loop. All vectorized approaches will need to boardcast arr to arr2 giving an NxM structure with a size that will offset the benefits.

I measured various approaches:

import numpy as np

def original(arr,arr2):
    arr3 = np.zeros(arr2.size) # this is the array i want to store the result in
    for index, num in enumerate(list(arr2)):
        arr3[index] = np.mean(arr[np.abs(num - arr) < 0.2])
    return arr3
        
def vectorized1(arr,arr2):
    delta  = (np.abs(arr2[:,None]-arr)<0.2).astype(np.int)
    return np.sum(arr*delta,axis=1)/np.sum(delta,axis=1)

def vectorized2(arr,arr2):
    return np.fromiter((np.mean(arr[np.abs(num-arr)<0.2]) for num in arr2),np.float64)

def vectorized3(arr,arr2):
    return np.apply_along_axis(lambda num:np.mean(arr[np.abs(num-arr)<0.2]),1,arr2[:,None])

def vectorized4(arr,arr2):
    select  = np.array([np.nan,1])[(np.abs(arr2[:,None]-arr)<0.2).astype(np.int)]
    return np.nanmean(select*arr,axis=1)

def vectorized5(arr,arr2):
    arr_tile = np.tile(arr, (len(arr2), 1))
    arr_tile[np.abs(arr - arr2[:, None]) >= 0.2] = np.NaN
    return np.nanmean(arr_tile, axis=1)

Note that vectorized2 and vectorized3 are not actually vectorizations of the calculations. They merely hide the loop that is being performed. Vectorized5 is Ananda's solution

Results when arr is larger than arr2:

from timeit import timeit
count = 1

arr  = np.random.random(10000)
arr2 = np.linspace(0, 1, 2000)

t = timeit(lambda:original(arr,arr2),number=count)
print("original    time:",t)

t = timeit(lambda:vectorized1(arr,arr2),number=count)
print("vectorized1 time:",t)

t = timeit(lambda:vectorized2(arr,arr2),number=count)
print("vectorized2 time:",t)

t = timeit(lambda:vectorized3(arr,arr2),number=count)
print("vectorized3 time:",t)

t = timeit(lambda:vectorized4(arr,arr2),number=count)
print("vectorized4 time:",t)

t = timeit(lambda:vectorized5(arr,arr2),number=count)
print("vectorized5 time:",t)

original    time: 0.14478049999999998
vectorized1 time: 0.3868172580000001
vectorized2 time: 0.14587923599999986
vectorized3 time: 0.15062318699999988
vectorized4 time: 0.6438709420000002
vectorized5 time: 0.543624409

Results when arr2 is larger than arr:

arr  = np.random.random(100)
arr2 = np.linspace(0, 1, 200000)

print()
print(f"arr {arr.size}, arr2 {arr2.size}")
t = timeit(lambda:original(arr,arr2),number=count)
print("original    time:",t)

t = timeit(lambda:vectorized1(arr,arr2),number=count)
print("vectorized1 time:",t)

t = timeit(lambda:vectorized2(arr,arr2),number=count)
print("vectorized2 time:",t)

t = timeit(lambda:vectorized3(arr,arr2),number=count)
print("vectorized3 time:",t)

t = timeit(lambda:vectorized4(arr,arr2),number=count)
print("vectorized4 time:",t)

t = timeit(lambda:vectorized5(arr,arr2),number=count)
print("vectorized5 time:",t)

original    time: 1.7699030359999997
vectorized1 time: 0.38871579499999953
vectorized2 time: 1.782099327
vectorized3 time: 2.443001885
vectorized4 time: 0.5951444290000012
vectorized5 time: 0.4536258110000002

note that this apparent improvement only holds true up to a point because, when memory is exceeded, the vectorized times will explode again even when arr2 is larger than arr.

Upvotes: 3

Ananda
Ananda

Reputation: 3270

One way of solving this would be by setting all the numbers that doesn't fit your condition to nan and taking the mean of the rest.

import numpy as np 

arr = np.random.random((100))
arr2 = np.linspace(0,1,20)
arr3 = np.zeros(20) # this is the array i want to store the result in...

for index,num in enumerate(list(arr2)): 
    arr3[index] = np.mean(arr[np.abs(num - arr) < 0.2])
    
arr_tile = np.tile(arr, (len(arr2), 1))
arr_tile[np.abs(arr - arr2[:, None]) >= 0.2] = np.NaN
res = np.nanmean(arr_tile, axis=1)

np.allclose(res, arr3)

This gives True

Upvotes: 1

Related Questions