Reputation: 10197
I've a function that is returning a subset of a column, and I'd like to efficiently apply it to each column. So the result is not a matrix anymore but a list of column of different length. I've failed to use numpy apply_along_axis
to do so because of this size mismatch. Is there a way to do so efficiently, other than iterating through the column myself?
col_pred = lambda x: [v for v in x if v > 0.5]
filteredData = np.apply_along_axis(col_pred, 0, data)
# ValueError: could not broadcast input array from shape (3) into shape (4)
for instance with an input
data = [[0, 1, 1, 0], [1, 1, 1, 1]]
// my real data is more like a matrix with a lot of rows in [0-1]
// that can be simulated with
// data = [[random.uniform(0, 1) for i in range(10)] for j in range(100000)]
I'd like to get
[[1, 1], [1, 1, 1, 1]]
Upvotes: 2
Views: 1486
Reputation: 231395
So as a Python list problem this is:
In [606]: col_pred = lambda x: [v for v in x if v > 0.5]
In [607]: data = [[0, 1, 1, 0], [1, 1, 1, 1]]
In [608]: [col_pred(i) for i in data]
Out[608]: [[1, 1], [1, 1, 1, 1]]
In your big data example, it takes much longer to generate the data than to run this list comprehension:
In [611]: data1 = [[np.random.uniform(0, 1) for i in range(10)] for j in range(100000)]
In [612]: timeit data1 = [[np.random.uniform(0, 1) for i in range(10)] for j in range(100000)]
1 loop, best of 3: 2.62 s per loop
In [615]: data2=[col_pred(i) for i in data1]
In [618]: timeit data2=[col_pred(i) for i in data1]
10 loops, best of 3: 191 ms per loop
Comparing this with `@Divakar's efficient numpy solution
In [622]: threshold_along_an_axis(np.array(data).T)
Out[622]: [[1, 1], [1, 1, 1, 1]]
In [624]: x3=threshold_along_an_axis(np.array(data1).T)
In [625]: timeit x3=threshold_along_an_axis(np.array(data1).T)
10 loops, best of 3: 214 ms per loop
Oops - it's slower; unless we take the array creation step out of the timing:
In [626]: arr=np.array(data1).T
In [627]: timeit x3=threshold_along_an_axis(arr)
10 loops, best of 3: 128 ms per loop
This an old familiar story. List comprehensions often perform well for small lists and when array creation adds significant overhead.
I don't have 1.13 and the new stuff that Eric mentions, but changing the list to a object array does let me use np.frompyfunc
:
In [640]: dataO = np.empty(len(data1), object)
In [641]: dataO[:]=data1
In [642]: x5=np.frompyfunc(col_pred, 1,1)(dataO)
In [643]: timeit x5=np.frompyfunc(col_pred, 1,1)(dataO)
10 loops, best of 3: 197 ms per loop
np.frompyfunc
takes an array (or arrays) applies broadcasting and evaluates a 'scalar' function, returning an object array. It is used by np.vectorize
, and often gives a 2x speed improvement over direct iteration. Here, though, it does not help.
Upvotes: 1
Reputation: 97591
If you want a ragged array in numpy, you have to use object arrays.
You first need a small helper function to convert any value into a 0d object array:
def object_scalar(x):
obj = np.empty((), dtype=object)
obj[()] = x
return obj
Then, in the upcoming 1.13, you can do:
>>> f = lambda x: object_scalar(col_pred(x))
>>> np.apply_along_axis(f, 0, data)
array([list([1]), list([1, 1]), list([1, 1]), list([1])], dtype=object)
Unfortunately, the latest released version of numpy has a bug that makes apply_along_axis
not handle 0d arrays correctly. You can work around that by promoting to a 1d array, and then demoting back to 0d afterwards:
>>> np.apply_along_axis(lambda x: f(x)[np.newaxis], 0, data).squeeze(axis=0)
array([[1], [1, 1], [1, 1], [1]], dtype=object)
Upvotes: 1
Reputation: 221574
Looking at your code, it seems you are trying to output all elements per column that are greater than a threshold 0.5
. Here's an approach to fulfil those and also generalized to handle those both along the rows and columns -
def threshold_along_an_axis(a, thresh = 0.5, axis=0):
if axis==0:
A = a.T
else:
A = a
mask = A>thresh
s = mask.sum(1)
s0 = np.r_[0,s.cumsum()]
arr = A[mask].tolist() # Skip .tolist() if list of arrays is needed as o/p
return [arr[s0[i]:s0[i+1]] for i in range(len(s0)-1)]
The intention here is to do minimal work inside the loop comprehension.
Sample run -
In [1]: a = np.random.rand(4,5)
In [2]: a
Out[2]:
array([[ 0.45973245, 0.3671334 , 0.12000436, 0.04205402, 0.74729737],
[ 0.55217308, 0.4018889 , 0.55695863, 0.55824384, 0.33435153],
[ 0.32450124, 0.07713855, 0.09126221, 0.13150986, 0.27961361],
[ 0.0876053 , 0.42685005, 0.53034652, 0.15084453, 0.51518185]])
In [3]: threshold_along_an_axis(a, thresh=0.5, axis=0) # per column
Out[3]:
[[0.5521730819881912],
[],
[0.5569586261866918, 0.5303465159370833],
[0.5582438446718111],
[0.7472973699509776, 0.5151818458812673]]
In [4]: threshold_along_an_axis(a, thresh=0.5, axis=1) # per row
Out[4]:
[[0.7472973699509776],
[0.5521730819881912, 0.5569586261866918, 0.5582438446718111],
[],
[0.5303465159370833, 0.5151818458812673]]
Upvotes: 2