Reputation: 10223

Identification of rows containing column median in numpy matrix of cum percentiles

Consider the matrix quantiles that's a subset [:8,:3,0] of a 3D matrix with shape (10,355,8).

quantiles = np.array([
       [ 1.        ,  1.        ,  1.        ],
       [ 0.63763978,  0.61848863,  0.75348137],
       [ 0.43439645,  0.42485407,  0.5341457 ],
       [ 0.22682343,  0.18878366,  0.25253915],
       [ 0.16229408,  0.12541476,  0.15263742],
       [ 0.12306046,  0.10372971,  0.09832783],
       [ 0.09271845,  0.08209844,  0.05982584],
       [ 0.06363636,  0.05471266,  0.03855727]])

I want a boolean output of the same shape as the quantiles matrix where True marks the row in which the median is located:

In [21]: medians
Out[21]:
array([[False, False, False],
       [ True, True,  False],
       [False,  False, True],
       [False, False, False],
       [False, False, False],
       [False, False, False],
       [False, False, False],
       [False, False, False]], dtype=bool)

To achieve this, I have the following algorithm in mind:

1) Identify the entries that are greater than .5:

In [22]: quantiles>.5
Out[22]:
array([[ True,  True,  True],
       [ True,  True,  True],
       [False, False,  True],
       [False, False, False],
       [False, False, False],
       [False, False, False],
       [False, False, False],
       [False, False, False]], dtype=bool)

2) Considering only the values subset by the quantiles>.5 operation, mark the row that minimizes the np.abs distance between the entry and .5. Torturing the terminology a bit, I wish to intersect the two matrices of np.argmin(np.abs(quantiles-.5),axis=0) and quantiles>.5 to get the above result. However, I cannot for my life figure out a way to perform the np.argmin on the subset and retain the shape of the quantile matrix.

PS. Yes, there is a similar question here but it doesn't implement my algorithm which could be, I think, more efficient on a larger scale

Upvotes: 1

Answers (2)

Divakar

Reputation: 221664

Approach #1

Here's an approach using broadcasting and some masking trick -

# Mask of quantiles lesser than or equal to 0.5 to select the invalid ones
mask1 = quantiles<=0.5

# Since we are dealing with quantiles, the elems won't be > 1, 
# which can be leveraged here as we will add 1s to invalid elems, and 
# then look for argmin across each col
min_idx = (np.abs(quantiles-0.5)+mask1).argmin(0)

# Let some broadcasting magic happen here!
out = min_idx == np.arange(quantiles.shape[0])[:,None]

Step-by-step run

1) Input :

In [37]: quantiles
Out[37]: 
array([[ 1.        ,  1.        ,  1.        ],
       [ 0.63763978,  0.61848863,  0.75348137],
       [ 0.43439645,  0.42485407,  0.5341457 ],
       [ 0.22682343,  0.18878366,  0.25253915],
       [ 0.16229408,  0.12541476,  0.15263742],
       [ 0.12306046,  0.10372971,  0.09832783],
       [ 0.09271845,  0.08209844,  0.05982584],
       [ 0.06363636,  0.05471266,  0.03855727]])

2) Run the code :

In [38]: mask1 = quantiles<=0.5
    ...: min_idx = (np.abs(quantiles-0.5)+mask1).argmin(0)
    ...: out = min_idx == np.arange(quantiles.shape[0])[:,None]
    ...:

3) Analyze output at each step :

In [39]: mask1
Out[39]: 
array([[False, False, False],
       [False, False, False],
       [ True,  True, False],
       [ True,  True,  True],
       [ True,  True,  True],
       [ True,  True,  True],
       [ True,  True,  True],
       [ True,  True,  True]], dtype=bool)

In [40]: np.abs(quantiles-0.5)+mask1
Out[40]: 
array([[ 0.5       ,  0.5       ,  0.5       ],
       [ 0.13763978,  0.11848863,  0.25348137],
       [ 1.06560355,  1.07514593,  0.0341457 ],
       [ 1.27317657,  1.31121634,  1.24746085],
       [ 1.33770592,  1.37458524,  1.34736258],
       [ 1.37693954,  1.39627029,  1.40167217],
       [ 1.40728155,  1.41790156,  1.44017416],
       [ 1.43636364,  1.44528734,  1.46144273]])

In [41]: (np.abs(quantiles-0.5)+mask1).argmin(0)
Out[41]: array([1, 1, 2])

In [42]: min_idx == np.arange(quantiles.shape[0])[:,None]
Out[42]: 
array([[False, False, False],
       [ True,  True, False],
       [False, False,  True],
       [False, False, False],
       [False, False, False],
       [False, False, False],
       [False, False, False],
       [False, False, False]], dtype=bool)

Performance boost : Following the comments, it seems to get min_idx, we can just do :

min_idx = (quantiles+mask1).argmin(0)

Approach #2

This is focused on memory efficiency.

# Mask of quantiles greater than 0.5 to select the valid ones
mask = quantiles>0.5

# Select valid elems
vals = quantiles.T[mask.T]

# Get vald count per col
count = mask.sum(0)

# Get the min val per col given the mask
minval = np.minimum.reduceat(vals,np.append(0,count[:-1].cumsum()))

# Get final boolean array by just comparing the min vals across each col
out = np.isclose(quantiles,minval)

Upvotes: 1

Gene Burinsky

Reputation: 10223

Bumping into the old mask operation in Numpy, I found the following solution

#mask quantities that are less than .5
masked_quantiles = ma.masked_where(quantiles<.5,quantiles)

#identify the minimum in column of the masked array
median_idx = np.where(masked_quantiles == masked_quantiles.min(axis=0))

#make a matrix of all False values
median_mat = np.zeros(quantiles.shape, dtype=bool)

#assign True value to corresponding rows
In [86]: median_mat[medians] = True

In [87]: median_mat
Out[87]:
array([[False, False, False],
       [ True,  True, False],
       [False, False,  True],
       [False, False, False],
       [False, False, False],
       [False, False, False],
       [False, False, False],
       [False, False, False]], dtype=bool)

Update: comparison of my answer to that of Divakar's:

I ran two comparisons, one on the sample 2D matrix provided for this question and one on my 3D (10,380,8) dataset (not large data by any means).

Sample dataset:

My code

%%timeit
masked_quantiles = ma.masked_where(quantiles<=.5,quantiles)
median_idx = masked_quantiles.argmin(0)

10000 loops, best of 3: 65.1 µs per loop

Divakar's code

%%timeit
mask1 = quantiles<=0.5
min_idx = (quantiles+mask1).argmin(0)

The slowest run took 17.49 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 5.92 µs per loop

Full dataset

My code:

%%timeit
masked_quantiles = ma.masked_where(quantiles<=.5,quantiles)
median_idx = masked_quantiles.argmin(0)

1000 loops, best of 3: 490 µs per loop

Divakar's code:

%%timeit
mask1 = quantiles<=0.5
min_idx = (quantiles+mask1).argmin(0)

10000 loops, best of 3: 172 µs per loop

Conclusion:

Divakar's answer seems about 3-12 times faster than mine. I presume that the np.ma.where masking operation takes longer than matrix addition. However, the addition operation needs to be stored whereas masking may be more efficient on larger datasets. I wonder how it would compare on something that doesn't or nearly doesn't fit into memory.

Upvotes: 1

Identification of rows containing column median in numpy matrix of cum percentiles

Answers (2)

Approach #1

Approach #2

Update: comparison of my answer to that of Divakar's:

Sample dataset:

Full dataset

Conclusion:

Related Questions