Reputation: 1185
I'm not sure if my problem is solvable, but thought I'd try; a search gave no result, at any rate.
The task: I have a large-ish dataset––approxomately 40k elements. These are rated in terms of familiarity by raters (i.e. if an item has a rating of 0.75, this means 75% of raters were familiar with it). I want to divide this data into 4 equally sized bins. The natural way to do this is with the pandas 'quantile' function to get interquartile ranges
The problem: 53% of my data is known to 100% of my participants. This means that two of my quantiles have the same value. As a result, feeding the results of the quantile function into my code gives an empty bin for one of the quantiles, as the first bin takes all the values (see code below.)
Does anyone know of a splitting my data in four even groups, even if all the data in two groups has the same value? I'd like to re-use this code, so putting in a kludge like specifying a specific index range to pick out a quarter of the data makes it too specific to this dataset.
Many thanks!
data3 = pd.read_csv('filepath.csv')
######### Empty lists to take variables
well = [] # Well-known elements
medwell = [] # Medium well known elements
med = [] # medium known elements
low = [] # Rarely known elements
############# Binning of data by familiarity
for i in range(39953):
if data3['Percent_known'][i] >= data3['Percent_known'].quantile(0.75):
well.append(data3['Word'][i]) # Familiarity
elif data3['Percent_known'][i] >= data3['Percent_known'].quantile(0.50) and \
data3['Percent_known'][i] < data3['Percent_known'].quantile(0.75):
medwell.append(data3['Word'][i])
elif data3['Percent_known'][i] >= data3['Percent_known'].quantile(0.25) and \
data3['Percent_known'][i] < data3['Percent_known'].quantile(0.50):
med.append(data3['Word'][i])
else:
low.append(data3['Word'][i])
Upvotes: 2
Views: 1146
Reputation: 16251
I would add a small, random jitter to the Percent_known
. In this way you will be able to (randomly) sort all the items known 100% into quantiles.
# create data
df = pd.DataFrame([1, 1, 1, 1, 0.5, 0.5, 0, 0], columns=['known'])
df['fudge'] = df.known + 0.01 * (np.random.rand(len(df)) - 0.5)
df.known[df.fudge > df.fudge.quantile(0.75)]
The last line will randomly select a quarter of items among those who are known 100%.
Additionally, it would be much more efficient to calculate quantiles in a vectorized fashion rather than with a loop. For instance:
df['quant'] = np.nan
for q in [0.75, 0.5, 0.25]:
df.loc[(df.fudge <= df.fudge.quantile(q + 0.25)) & (df.fudge > df.fudge.quantile(q)), 'quant'] = q
df.quant.fillna(0.0, inplace=True)
Upvotes: 3