Reputation: 141
I have a set of data, and a set of thresholds for creating bins:
data = np.array([0.01, 0.02, 1, 1, 1, 2, 2, 8, 8, 4.5, 6.6])
thresholds = np.array([0,5,10])
bins = np.digitize(data, thresholds, right=True)
For each of the elements in bins
, I want to know the base percentile. For example, in bins
, the smallest bin should start at the 0th percentile. Then the next bin, for example, the 20th percentile. So that if a value in data
falls between the 0th and 20th percentile of data
, it belongs in the first bin
.
I've looked into pandas rank(pct=True)
but can't seem to get this done correctly.
Suggestions?
Upvotes: 2
Views: 5366
Reputation: 691
You can calculate the percentile for each element in your data array as described in a previous StackOverflow question (Map each list value to its corresponding percentile).
import numpy as np
from scipy import stats
data = np.array([0.01, 0.02, 1, 1, 1, 2, 2, 8, 8, 4.5, 6.6])
Method 1: Using scipy.stats.percentileofscore :
data_percentile = np.array([stats.percentileofscore(data, a) for a in data])
data_percentile
Out[1]:
array([ 9.09090909, 18.18181818, 36.36363636, 36.36363636,
36.36363636, 59.09090909, 59.09090909, 95.45454545,
95.45454545, 72.72727273, 81.81818182])
Method 2: Using scipy.stats.rankdata and normalising to 100 (faster) :
ranked = stats.rankdata(data)
data_percentile = ranked/len(data)*100
data_percentile
Out[2]:
array([ 9.09090909, 18.18181818, 36.36363636, 36.36363636,
36.36363636, 59.09090909, 59.09090909, 95.45454545,
95.45454545, 72.72727273, 81.81818182])
Now that you have a list of percentiles, you can bin them as before using numpy.digitize :
bins_percentile = [0,20,40,60,80,100]
data_binned_indices = np.digitize(data_percentile, bins_percentile, right=True)
data_binned_indices
Out[3]:
array([1, 1, 2, 2, 2, 3, 3, 5, 5, 4, 5], dtype=int64)
This gives you the data binned according to the indices of your chosen list of percentiles. If desired, you could also return the actual (upper) percentiles using numpy.take :
data_binned_percentiles = np.take(bins_percentile, data_binned_indices)
data_binned_percentiles
Out[4]:
array([ 20, 20, 40, 40, 40, 60, 60, 100, 100, 80, 100])
Upvotes: 8