Wana_B3_Nerd
Wana_B3_Nerd

Reputation: 643

How to count how many data points fall in a bin

I have set the parameters of my bins and I want to find how to add one to the bin when a data point falls in the range of a particular bin, essentially count how many data points fall in each bin range, so that I can use that as the "frequency" when I graph it out.

My bins ranges are set by:

 bins = [(i*bin_width, (i+1)*bin_width) for i in range(num_bins)]

and my data looks something like:

2.55619101399
2.55619101399
2.55619101399
3.615
4.42745271008
2.55619101399
2.55619101399
2.55619101399
4.42745271008
3.615
2.55619101399
4.42745271008
5.71581687075
5.71581687075
3.615
2.55619101399
2.55619101399
2.55619101399
2.55619101399
2.55619101399

Upvotes: 3

Views: 20986

Answers (3)

abarnert
abarnert

Reputation: 365657

Since you're using NumPy, you (a) shouldn't be trying to create lists and loop over them instead of using arrays, and (b) should look to see if what you want to do is already built-in (or available in SciPy or Pandas or some other library built on NumPy), because often it is.

And numpy.histogram is exactly what you want.

It takes a total width rather than a bin width, but other than that, it's trivial to plug in the values you already have and get back the values you want:

hist, edges = np.histogram(
    data_points,
    bins=num_bins,
    range=(0, bin_width*num_bins),
    density=False)

The hist array will contain the counts for each bin (like bin_counts in my other answer), which is what you want to post-process and eventually graph.

The edges, you may or may not need. It's the same information as the bins in your original question, but in different format—instead of [(0, .1), (.1, .2), (.2, .3)] it's [0, .1, .2, .3].

Upvotes: 6

Jonk
Jonk

Reputation: 23

from collections import Counter

frequency_data = Counter()

    for d in data:
        new_bins = bins
        median = len(new_bins)/2
        while not new_bins[median][0] < d < new_bins[median][1]:
            if d < new_bins[median][0]:
                new_bins = new_bins[:median]
            elif d > new_bins[median][1]:
                new_bins = new_bins[median:]
            median = len(new_bins)/2
        frequency_data[new_bins[median]] += 1

Upvotes: 0

abarnert
abarnert

Reputation: 365657

Well, first, each of your bins is just a tuple of the start and end values of that bin, so there's no way to add anything to it. You could change each bin into, say, list of [start, stop, 0] instead of a tuple of (start, stop), or, maybe even better, an object. Or, alternatively, you could keep a separate bin_counts list, parallel to the bins list, and, e.g., zip them up when needed.

Next, if each bin goes from i * bin_width to (i+1) * bin_width, then how do you get the i value from a data value? That's easy: the opposite of multiply is divide, so it's just data_point // bin_width.

So:

bin_counts = [0 for bin in bins]
for data_point in data_points:
    bin_number = data_point // bin_width
    bin_counts[bin_number] += 1

Showing one of the other options, because I think you were asking about it in the comments:

bins = [[i*bin_width, (i+1)*bin_width, 0] for i in range(num_bins)]
for data_point in data_points:
    bin_number = data_point // bin_width
    bins[bin_number][2] += 1

Here, each bin is a list of [start, stop, count], instead of having a list of (start, stop) bins and a separate list of count values.

Upvotes: 3

Related Questions