Reputation: 534
I have some data that I bin in x and y. I then normalize the data in the x bins such that all the data in the xbins sums to 1... so I have a normalized probability for each value of y at each x.
nA, binsx, binsy = np.histogram2d(dataA,dataB,
bins=[binsA,binsB],normed=False)
H = np.ma.masked_where(nA==0.0, nA)
for i in range(len(H[0,:])): # Column index i, over len of row 0
colTot = np.sum(H[:,i])
for j in range(len(H[:,0])): # Row index j, over len of column 0
H[j,i] = H[j,i]/colTot
At this point H is normalized along columns... each sums to 1.
My question is, how can I efficiently generate the median value in each column? I believe I need to generate a new array, for each column (or set of values in an xbin) that has a number of y values equal to the original (nA) count for that ybin. Seems convoluted... is there an easier way?
Here's what I'm trying now:
nA, binsx, binsy = np.histogram2d(dataA,dataB,
bins=[binsA,binsB],normed=False)
for j in range(nA[0,:].size): # Loop over number of columns
oneMass = np.array([])
for i in range(nA[:,0].size): # loop over rows in y...
tmp = np.repeat(binsA[i],np.int32(nA[i,j]))
if tmp.size > 0:
oneMass = np.concatenate((oneMass,tmp) )
print('Median',np.median(oneMass))
Upvotes: 1
Views: 694
Reputation: 14399
If you've already normalized the columns, You could just do a linear interpolation to .5 over the cumulative probability function:
cumCols = np.cumsum(H, axis = 1)
medians = np.array([np.interp(.5, binsA, cumCols[:,i]) for i in range(len(binsA))])
Upvotes: 2