Getting quartiles in pandas when data has low variance

Question

I'm not sure if my problem is solvable, but thought I'd try; a search gave no result, at any rate.

The task: I have a large-ish dataset––approxomately 40k elements. These are rated in terms of familiarity by raters (i.e. if an item has a rating of 0.75, this means 75% of raters were familiar with it). I want to divide this data into 4 equally sized bins. The natural way to do this is with the pandas 'quantile' function to get interquartile ranges

The problem: 53% of my data is known to 100% of my participants. This means that two of my quantiles have the same value. As a result, feeding the results of the quantile function into my code gives an empty bin for one of the quantiles, as the first bin takes all the values (see code below.)

Does anyone know of a splitting my data in four even groups, even if all the data in two groups has the same value? I'd like to re-use this code, so putting in a kludge like specifying a specific index range to pick out a quarter of the data makes it too specific to this dataset.

Many thanks!

  data3 = pd.read_csv('filepath.csv')



######### Empty lists to take variables

well = [] # Well-known elements


medwell = [] # Medium well known elements

med = [] # medium known elements

low = [] # Rarely known elements

############# Binning of data by familiarity 

for i in range(39953): 
    if data3['Percent_known'][i] >= data3['Percent_known'].quantile(0.75):
        well.append(data3['Word'][i]) # Familiarity
    elif data3['Percent_known'][i] >= data3['Percent_known'].quantile(0.50) and \
    data3['Percent_known'][i] < data3['Percent_known'].quantile(0.75):
        medwell.append(data3['Word'][i])
    elif data3['Percent_known'][i] >= data3['Percent_known'].quantile(0.25) and \
    data3['Percent_known'][i] < data3['Percent_known'].quantile(0.50):
        med.append(data3['Word'][i])
    else:
        low.append(data3['Word'][i])

IanS · Accepted Answer

I would add a small, random jitter to the Percent_known. In this way you will be able to (randomly) sort all the items known 100% into quantiles.

# create data
df = pd.DataFrame([1, 1, 1, 1, 0.5, 0.5, 0, 0], columns=['known'])

df['fudge'] = df.known + 0.01 * (np.random.rand(len(df)) - 0.5)

df.known[df.fudge > df.fudge.quantile(0.75)]

The last line will randomly select a quarter of items among those who are known 100%.

Additionally, it would be much more efficient to calculate quantiles in a vectorized fashion rather than with a loop. For instance:

df['quant'] = np.nan

for q in [0.75, 0.5, 0.25]:
    df.loc[(df.fudge <= df.fudge.quantile(q + 0.25)) & (df.fudge > df.fudge.quantile(q)), 'quant'] = q

df.quant.fillna(0.0, inplace=True)

Getting quartiles in pandas when data has low variance

Answers (1)

Related Questions