Reputation: 5473
I have created a distribution using numpy histogram and digitize functions.
_, bins = np.histogram(x, bins=bins)
arr = np.digitize(x, bins) - 1
x = bins[arr[:]]
Or possibly:
x = pandas.cut(x, bins=bins)
However as the distribution is very skewed, even after removing outliers, there are many bins with very little observations. I want to merge bins, somewhat similar to:
The procedure would possibly involve pandas groupby and then merging the group sizes less than n
to their neighbouring values. Is there a way to achieve this in pandas/numpy?
Upvotes: 4
Views: 1160
Reputation: 2093
As promised, I implemented something in physt, version 0.3.5. You're welcome to use it.
See http://nbviewer.jupyter.org/github/janpipek/physt/blob/master/doc/Binning2.ipynb#Merging-bins and particularly http://nbviewer.jupyter.org/github/janpipek/physt/blob/master/doc/Binning2.ipynb#By-min-frequency
In your case, the workflow would be something like this:
import physt
histogram = physt.h1(x, bins=bins)
histogram.merge_bins(min_frequency=n)
bins = histogram.numpy_bins
Note that the code is in alpha stage and not each bin contains more than the required minimum (in order to preserve tall narrow bins). The best algorithm is still being looked for.
Upvotes: 1