Reputation: 10223
Consider the matrix quantiles
that's a subset [:8,:3,0]
of a 3D matrix with shape (10,355,8)
.
quantiles = np.array([
[ 1. , 1. , 1. ],
[ 0.63763978, 0.61848863, 0.75348137],
[ 0.43439645, 0.42485407, 0.5341457 ],
[ 0.22682343, 0.18878366, 0.25253915],
[ 0.16229408, 0.12541476, 0.15263742],
[ 0.12306046, 0.10372971, 0.09832783],
[ 0.09271845, 0.08209844, 0.05982584],
[ 0.06363636, 0.05471266, 0.03855727]])
I want a boolean output of the same shape as the quantiles
matrix where True
marks the row in which the median is located:
In [21]: medians
Out[21]:
array([[False, False, False],
[ True, True, False],
[False, False, True],
[False, False, False],
[False, False, False],
[False, False, False],
[False, False, False],
[False, False, False]], dtype=bool)
To achieve this, I have the following algorithm in mind:
1) Identify the entries that are greater than .5
:
In [22]: quantiles>.5
Out[22]:
array([[ True, True, True],
[ True, True, True],
[False, False, True],
[False, False, False],
[False, False, False],
[False, False, False],
[False, False, False],
[False, False, False]], dtype=bool)
2) Considering only the values subset by the quantiles>.5
operation, mark the row that minimizes the np.abs
distance between the entry and .5
. Torturing the terminology a bit, I wish to intersect the two matrices of np.argmin(np.abs(quantiles-.5),axis=0)
and quantiles>.5
to get the above result. However, I cannot for my life figure out a way to perform the np.argmin
on the subset and retain the shape of the quantile
matrix.
PS. Yes, there is a similar question here but it doesn't implement my algorithm which could be, I think, more efficient on a larger scale
Upvotes: 1
Views: 169
Reputation: 221664
Here's an approach using broadcasting
and some masking trick -
# Mask of quantiles lesser than or equal to 0.5 to select the invalid ones
mask1 = quantiles<=0.5
# Since we are dealing with quantiles, the elems won't be > 1,
# which can be leveraged here as we will add 1s to invalid elems, and
# then look for argmin across each col
min_idx = (np.abs(quantiles-0.5)+mask1).argmin(0)
# Let some broadcasting magic happen here!
out = min_idx == np.arange(quantiles.shape[0])[:,None]
Step-by-step run
1) Input :
In [37]: quantiles
Out[37]:
array([[ 1. , 1. , 1. ],
[ 0.63763978, 0.61848863, 0.75348137],
[ 0.43439645, 0.42485407, 0.5341457 ],
[ 0.22682343, 0.18878366, 0.25253915],
[ 0.16229408, 0.12541476, 0.15263742],
[ 0.12306046, 0.10372971, 0.09832783],
[ 0.09271845, 0.08209844, 0.05982584],
[ 0.06363636, 0.05471266, 0.03855727]])
2) Run the code :
In [38]: mask1 = quantiles<=0.5
...: min_idx = (np.abs(quantiles-0.5)+mask1).argmin(0)
...: out = min_idx == np.arange(quantiles.shape[0])[:,None]
...:
3) Analyze output at each step :
In [39]: mask1
Out[39]:
array([[False, False, False],
[False, False, False],
[ True, True, False],
[ True, True, True],
[ True, True, True],
[ True, True, True],
[ True, True, True],
[ True, True, True]], dtype=bool)
In [40]: np.abs(quantiles-0.5)+mask1
Out[40]:
array([[ 0.5 , 0.5 , 0.5 ],
[ 0.13763978, 0.11848863, 0.25348137],
[ 1.06560355, 1.07514593, 0.0341457 ],
[ 1.27317657, 1.31121634, 1.24746085],
[ 1.33770592, 1.37458524, 1.34736258],
[ 1.37693954, 1.39627029, 1.40167217],
[ 1.40728155, 1.41790156, 1.44017416],
[ 1.43636364, 1.44528734, 1.46144273]])
In [41]: (np.abs(quantiles-0.5)+mask1).argmin(0)
Out[41]: array([1, 1, 2])
In [42]: min_idx == np.arange(quantiles.shape[0])[:,None]
Out[42]:
array([[False, False, False],
[ True, True, False],
[False, False, True],
[False, False, False],
[False, False, False],
[False, False, False],
[False, False, False],
[False, False, False]], dtype=bool)
Performance boost : Following the comments, it seems to get min_idx, we can just do :
min_idx = (quantiles+mask1).argmin(0)
This is focused on memory efficiency.
# Mask of quantiles greater than 0.5 to select the valid ones
mask = quantiles>0.5
# Select valid elems
vals = quantiles.T[mask.T]
# Get vald count per col
count = mask.sum(0)
# Get the min val per col given the mask
minval = np.minimum.reduceat(vals,np.append(0,count[:-1].cumsum()))
# Get final boolean array by just comparing the min vals across each col
out = np.isclose(quantiles,minval)
Upvotes: 1
Reputation: 10223
Bumping into the old mask
operation in Numpy
, I found the following solution
#mask quantities that are less than .5
masked_quantiles = ma.masked_where(quantiles<.5,quantiles)
#identify the minimum in column of the masked array
median_idx = np.where(masked_quantiles == masked_quantiles.min(axis=0))
#make a matrix of all False values
median_mat = np.zeros(quantiles.shape, dtype=bool)
#assign True value to corresponding rows
In [86]: median_mat[medians] = True
In [87]: median_mat
Out[87]:
array([[False, False, False],
[ True, True, False],
[False, False, True],
[False, False, False],
[False, False, False],
[False, False, False],
[False, False, False],
[False, False, False]], dtype=bool)
I ran two comparisons, one on the sample 2D matrix provided for this question and one on my 3D (10,380,8)
dataset (not large data by any means).
My code
%%timeit
masked_quantiles = ma.masked_where(quantiles<=.5,quantiles)
median_idx = masked_quantiles.argmin(0)
10000 loops, best of 3: 65.1 µs per loop
Divakar's code
%%timeit
mask1 = quantiles<=0.5
min_idx = (quantiles+mask1).argmin(0)
The slowest run took 17.49 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 5.92 µs per loop
My code:
%%timeit
masked_quantiles = ma.masked_where(quantiles<=.5,quantiles)
median_idx = masked_quantiles.argmin(0)
1000 loops, best of 3: 490 µs per loop
Divakar's code:
%%timeit
mask1 = quantiles<=0.5
min_idx = (quantiles+mask1).argmin(0)
10000 loops, best of 3: 172 µs per loop
Divakar's answer seems about 3-12 times faster than mine. I presume that the np.ma.where
masking operation takes longer than matrix addition. However, the addition operation needs to be stored whereas masking may be more efficient on larger datasets. I wonder how it would compare on something that doesn't or nearly doesn't fit into memory.
Upvotes: 1