Return data indices for all bins with counts greater than threshold

Question

I am trying to find the indices all within a certain bin of the data binned liked this:

import numpy as np

x=np.random.random(1000)
y=np.random.random(1000)
#The bins are not evenly spaced and not the same number in x and y. 
xedges=np.array(0.1,0.2, 0.4, 0.5, 0.55, 0.6, 0.8, 0.9)
yedges=np.arange(0.1,0.2, 0.4, 0.5, 0.55, 0.6, 0.8, 0.9)

h=np.histogram2d(x,y, bins=[xedges,yedges])

I want to find the indices (then plot them etc) contained in each bin that is greater than some threshold number of counts. So each bin with a count greater than the threshold is a "cluster" and I want to know all the datapoints (x,y) in that cluster.

I wrote in pseudocode how I think it would work.

thres=5 
mask=(h>5)

for i in mask:
    # for each bin with count > thres 
    # get bin edges for x and y directions 

    # find  (rightEdge < x < leftEdge) and (rightEdge < y < leftEdge)

    # return indices for each True in mask 

plt.plot(x[indices], y[indicies])

I tried reading the documentation for functions such as scipy.stats.binned_statistic2d and pandas.DataFrame.groupby but I couldn't figure out how to apply it to my data. For the binned_statistic2d they ask for an argument values :

The data on which the statistic will be computed. This must be the same shape as x, or a set of sequences - each the same shape as x.

And I wasn't sure how to input the data I wanted it to be computed on.

Thank you for any help you can provide on this issue.

JohanC · Accepted Answer

If I understand correctly, you want to build a mask on the original points indicating that the point belongs to a bin with more than 5 points.

To construct such a mask, np.histogram2d returns the counts for each bin, but does not indicate which point goes into which bin.

You can construct such a mask by iterating over each bin that fulfills the condition, and add all corresponding point indices to the mask.

To visualize the result of np.histogram2d, plt.pcolormesh can be used. Drawing the mesh with h > 5 will show all the True values with the highest color (red) and the False values with the lowest color (blue).

from matplotlib import pyplot as plt
import numpy as np

x = np.random.uniform(0, 2, 500)
y = np.random.uniform(0, 1, x.shape)

xedges = np.array([0.1, 0.2, 0.5, 0.55, 0.6, 0.8, 1.0, 1.3, 1.5, 1.9])
yedges = np.array([0.1, 0.2, 0.4, 0.5, 0.55, 0.6, 0.8, 0.9])

hist, _xedges, _yedges = np.histogram2d(x, y, bins=[xedges, yedges])

h = hist.T  # np.histogram2d transposes x and y, therefore, transpose the resulting array
thres = 5
desired = h > thres
plt.pcolormesh(xedges, yedges, desired, cmap='coolwarm', ec='white', lw=2)

mask = np.zeros_like(x, dtype=np.bool)  # start with mask all False
for i in range(len(xedges) - 1):
    for j in range(len(yedges) - 1):
        if desired[j, i]:
            # print(f'x from {xedges[i]} to {xedges[i + 1]} y from {yedges[j]} to {yedges[j + 1]}')
            mask = np.logical_or(mask, (x >= xedges[i]) & (x < xedges[i + 1]) & (y >= yedges[j]) & (y < yedges[j + 1]))
            # plt.scatter(np.random.uniform(xedges[i], xedges[i+1], 100), np.random.uniform(yedges[j], yedges[j+1], 100),
            #             marker='o', color='g', alpha=0.3)
plt.scatter(x, y, marker='o', color='gold', label='initial points')
plt.scatter(x[mask], y[mask], marker='.', color='green', label='filtered points')
plt.legend(bbox_to_anchor=(1.02, 1), loc='upper left')
plt.tight_layout()
plt.show()

Note that in the given example the edges don't cover the complete range of points. The points outside the given edges will not be taken into account. To include these points, just extend the edges.

Return data indices for all bins with counts greater than threshold

Answers (1)

Related Questions