iblasi
iblasi

Reputation: 1307

Math percentile on histogram bins. Edge error

I have a set of points along X and Y where I want to create bins on X for small ranges and calculate the percetile for each bin to create a polynomial regression fit for all the bins and have a continuous percentile approach. The problem is on the edges. The number of points on the edge bins is lower and due to this problem, the percentile values are distorted.

The folowing image and the code shows how everything is done. The percentile values as you can see are calculated for a maximum defined on 99.8 and a minimum defined on 4.0: enter image description here

import numpy as np
import matplotlib.pyplot as plt

###############################
degree = 8
step = 0.05
numPercUp = 99.8
numPercDown = 4.0

###############################
fig = plt.figure(figsize=(8, 6))

dataX = np.random.uniform(low=0.14, high=2.06, size=(1000))
dataY = np.random.uniform(low=50, high=550, size=(1000))
plt.scatter(dataX, dataY, c='b', s=5, marker="+", label="data")

xMin = np.min(dataX)
xMax = np.max(dataX)
print 'xMin: ', xMin
print 'xMax: ', xMax

xMin = (int(xMin / step)+1) * step
xMax = (int(xMax / step)+1) * step
print 'xMin: ', xMin
print 'xMax: ', xMax

bins = np.arange(xMin, xMax, step)
inds = np.digitize(dataX, bins) # http://stackoverflow.com/questions/2275924/how-to-get-data-in-a-histogram-bin
print 'bins: ', bins, bins[0], bins[-1], len(bins)
print 'inds: ', np.min(inds), np.max(inds), np.sum(inds == 0)


# Percentile coordinates
percX    = np.arange(xMin, xMax+step, step) - step/2    # All bin X position centered on the bin
percUp   = np.zeros(len(bins)+1)
percDown = np.zeros(len(bins)+1)

for i in range(len(bins)+1):
    dataBin = dataY[inds == i]
    percUp[i]   = np.percentile(dataBin, numPercUp)
    percDown[i] = np.percentile(dataBin, numPercDown)

print 'percX: ', percX
print 'percUp: ', percUp
plt.plot(percX, percUp, color='green', linestyle='-', linewidth=2, marker="o", markerfacecolor='red', markersize=5, label="Up perc.")
plt.plot(percX, percDown, color='green', linestyle='-', linewidth=2, marker="o", markerfacecolor='red', markersize=5, label="Down perc.")


# Polynomial Regression
z = np.polyfit(percX, percUp, degree)
f = np.poly1d(z)
x_new = np.linspace(0.1, 2.1, 50)
y_new = f(x_new)
plt.plot(x_new, y_new, 'r--')

z = np.polyfit(percX, percDown, degree)
f = np.poly1d(z)
x_new = np.linspace(0.1, 2.1, 50)
y_new = f(x_new)
plt.plot(x_new, y_new, 'r--')


# Frame specification
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Perc. Min/Max')
plt.grid()
plt.legend()
plt.axis([0, 2.2, 0, 600])
plt.xticks(np.arange(0, 2.21, 0.1))
plt.yticks(np.arange(0, 600+1, 50))
plt.show()

I find 3 possibilities, but none of them convince me at all:

  1. Make bigger bins instead of 0.05, but this will make to reduce accuracy.
  2. Reduce the degree of the polynomial fit, but on the real data (see final image) it also reduces the accuracy and does not fit to expected.
  3. Delete the edge values before fitting the polynomial regression (maybe this one is the most convincing option for me)

The following image is a real dataset example that I have. So as you can see, low degrre fit polynomial cannot be used, and make bigger bins want make to be much better on the right side (high x).

enter image description here

Which could be a good solution? Maybe forget about continuous and look for gaussian filter and then use nearest on X to define a smooth regression line?

Upvotes: 2

Views: 1749

Answers (1)

Sam Mason
Sam Mason

Reputation: 16184

Transform your Xs so you give more importance to the values you care about; for example, taking -log(2.1-X) will give you a basically linear response near 0 and exponential increase nearer 2. Using this function to determine the bins will give better estimates of the values near two.

Lets start by generating some dummy data:

X = np.linspace(0,2,50000)
Y = np.random.gamma(4, 0.1+(X**2)*(2-X)/(0.01+(2-X)))
plt.plot(X,Y,'.')
plt.margins(0.04)

generated data

define a function to transform the Xs and its inverse:

def xfrm(X): return -np.log(2.05-np.array(X))
def ivrt(Y): return 2.05-np.exp(-np.array(Y))

We can then get the histogram counts:

Xi = xfrm(X)
bins = np.linspace(np.min(Xi),np.max(Xi)+1e-5,201)
ii = np.digitize(Xi,bins)
pcts = np.array([np.percentile(Y[ii==i],[4,95]) for i in range(1,len(bins))])

and generate some plots to make sure it's behaving as expected:

fig,axs = plt.subplots(2,figsize=(8,10))

mids = bins[1:] - np.diff(bins)/2
axs[0].plot(X,Y,'.',zorder=1)
axs[0].vlines(ivrt(mids),pcts[:,0],pcts[:,1],lw=1);
axs[0].margins(0.04)
axs[1].plot(Xi,Y,'.',zorder=1)
axs[1].vlines(mids,pcts[:,0],pcts[:,1],lw=1);
axs[1].margins(0.04)

f = np.poly1d(np.polyfit(mids, pcts[:,1], 8))
axs[0].plot(ivrt(mids), f(mids),lw=3)
axs[1].plot(mids, f(mids))

f = np.poly1d(np.polyfit(mids, pcts[:,0], 8))
axs[0].plot(ivrt(mids), f(mids),lw=3)
axs[1].plot(mids, f(mids));

The upper plot is of the original values and the bottom is of the transformed values. The vertical lines show the values used to generate the fits.

fit

I think I may have got a bit carried away with this one, but hope it's interesting!

Upvotes: 4

Related Questions