hangc
hangc

Reputation: 5473

Pandas Merge Bins

I have created a distribution using numpy histogram and digitize functions.

_, bins = np.histogram(x, bins=bins)
arr = np.digitize(x, bins) - 1
x = bins[arr[:]]

Or possibly:

x = pandas.cut(x, bins=bins)

However as the distribution is very skewed, even after removing outliers, there are many bins with very little observations. I want to merge bins, somewhat similar to:

How to merge bins in R

The procedure would possibly involve pandas groupby and then merging the group sizes less than n to their neighbouring values. Is there a way to achieve this in pandas/numpy?

Upvotes: 4

Views: 1160

Answers (1)

honza_p
honza_p

Reputation: 2093

As promised, I implemented something in physt, version 0.3.5. You're welcome to use it.

See http://nbviewer.jupyter.org/github/janpipek/physt/blob/master/doc/Binning2.ipynb#Merging-bins and particularly http://nbviewer.jupyter.org/github/janpipek/physt/blob/master/doc/Binning2.ipynb#By-min-frequency

In your case, the workflow would be something like this:

import physt
histogram = physt.h1(x, bins=bins)
histogram.merge_bins(min_frequency=n)
bins = histogram.numpy_bins 

Note that the code is in alpha stage and not each bin contains more than the required minimum (in order to preserve tall narrow bins). The best algorithm is still being looked for.

Upvotes: 1

Related Questions