dougie fresh
dougie fresh

Reputation: 65

How to efficiently produce a histogram with a large number of bins and data

I'm asked to look at how the central limit theory applies with uniformly distributed random numbers. For the first part of the problem I'm asked to created 1,000,000 bins with one number in each bin and then 2, 3, and 10 numbers in each bin.

I've used the NumPy package for creating histograms but trying to create 1,000,000 bins with one number in each bin takes an ungodly amount of time. I was able to create a histogram of 1,000 and 10,000 bins and random numbers though so I think numpy.hist just isn't an efficient method for handling a large number of bins.

Are there other methods for creating histograms with large amounts of data and bins?

EDIT: the random number are in the interval [0,1].

Upvotes: 2

Views: 819

Answers (1)

hyperTrashPanda
hyperTrashPanda

Reputation: 868

You've left details out of your question that could be crucial.

What's your bin size (i.e. do you have 1M bins between [0,1], between [0,20], or between [0,1M])..? What are your performance requirements and what is "slow" for your purposes? Are you hitting memory limits, CPU usage limits or something else?

One trivial solution is to use random.random() to generate a random number between [0,1], and then use multiplication/addition it to sample in whichever interval you need.

The following code samples 1M bins, of size 1 each, with each bin containing 2 numbers.

import random

hist_data = []
in_each_bin = 2

for i in range(1000000):
        for j in range(in_each_bin):
                hist_data.append(i+random.random())

print(len(hist_data))
print(hist_data[0:20])

It runs on under 3 seconds on my medium machine.

$ time python3 pytest.py
2000000
[0.9271533001749838, 0.6759096885597532, 1.0950935186564377, 1.4195955772696995, 2.620307487968376, 2.535700184898931, 3.606823695579621, 3.5471311130365346, 4.01255833303964, 4.013715023517034, 5.42988725471679, 5.257435390135351, 6.681956593279519, 6.686189487682324, 7.916591795688389, 7.598478524938438, 8.309152266029844, 8.997231092516385, 9.801082205541228, 9.198095437802664]

real    0m3.418s
user    0m2.547s
sys     0m0.500s

Does that fit your needs and requirements?

Upvotes: 1

Related Questions