Reputation: 134
This is my dataframe with interval numbers (classes) in it.
df = pd.DataFrame({'Class': [1,2,3,4,5,6,7,8,9,10,11],
'Class Interval': ['16.25-18.75', '18.75-21.25', '21.25-23.75',
'23.75-26.25', '26.25-28.75', '28.75-31.25',
'31.25-33.75', '33.75-36.25', '36.25-38.75',
'38.75-41.25', '41.25-43.75'],
'𝑓𝑖' : [2,7,7,14,17,24,11,11,3,3,1],
'Cumulative 𝑓𝑖': [2,9,16,30,47,71,82,93,96,99,100],
'𝑓𝑖/n' : [.02,.07,.07,.14,.17,.24,.11,.11,.03,.03,.01],
'Cumulative 𝑓𝑖/n' : [.02, .09,.16,.30,.47,.71,.82,.93,.96,.99,1.00]})
df
Class Class Interval 𝑓𝑖 Cumulative 𝑓𝑖 𝑓𝑖/𝑛 Cumulative 𝑓𝑖/𝑛
0 1 16.25-18.75 2 2 0.02 0.02
1 2 18.75-21.25 7 9 0.07 0.09
2 3 21.25-23.75 7 16 0.07 0.16
3 4 23.75-26.25 14 30 0.14 0.30
4 5 26.25-28.75 17 47 0.17 0.47
5 6 28.75-31.25 24 71 0.24 0.71
6 7 31.25-33.75 11 82 0.11 0.82
7 8 33.75-36.25 11 93 0.11 0.93
8 9 36.25-38.75 3 96 0.03 0.96
9 10 38.75-41.25 3 99 0.03 0.99
10 11 41.25-43.75 1 100 0.01 1.00
Question: How can I calculate with python the grouped median for this dataframe?
Manually this can be done and the outcome is 29.06.
I've tried 'median_grouped':
# importing median_grouped from the statistics module
from statistics import median_grouped
# printing median_grouped for the set
print("Grouped Median is %s" %(median_grouped(df['Class Interval'])))
But I got the error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-26-491000133032> in <module>
4
5 # printing median_grouped for the set
----> 6 print("Grouped Median is %s" %(median_grouped(df['Class Interval'])))
~\Anaconda3\ANACONDA\lib\statistics.py in median_grouped(data, interval)
463 for obj in (x, interval):
464 if isinstance(obj, (str, bytes)):
--> 465 raise TypeError('expected number but got %r' % obj)
466 try:
467 L = x - interval/2 # The lower limit of the median interval.
TypeError: expected number but got '28.75-31.25'
Than I've tried to make two columns (one with the lower limit and one with the higher limit), but than he only gave me the lower limit (28.75) / higher limit median (31.25). I've also tried only the lower limit, but of course than he gives me 28.75 as well.
I do not have the values within the intervals, so I can not remake a list of values to cut with pd.cut and try it like that correctly (I do not want to guess), but I've also tried to manually make the class interval into bins (for example 16.25-18.25 is than (16.25,18.25], but than I got the error message: TypeError: unorderable types: Interval() < float()
Is there a possibility to make the column with the interval numeric in stead of a string to be able to calculate the grouped Median automatically with Python?
Upvotes: 3
Views: 940
Reputation: 7591
You can recreate a list of artificial data points holding the same statistical information (middle value of each interval * fi of the interval), and run the mean_grouped
function in them:
# Obtaining lower, upper and middle interval value
df['lower'] = df['Class Interval'].str.split('-', expand=True)[0].astype(float)
df['upper'] = df['Class Interval'].str.split('-', expand=True)[1].astype(float)
df['middle'] = (df['lower'] + df['upper'] ) / 2
# Generating an artificial list of values with the same statistical info
artificial_data_list = []
for index, row in df.iterrows():
artificial_data_list.append([row['middle']]*row['𝑓𝑖'])
flat_list = [item for sublist in artificial_data_list for item in sublist]
# Calcuating the right median with the statistics.mean_grouped function
median_grouped(flat_list,interval=2.5) # Attention to the interval size!
# => 29.0625
Upvotes: 1
Reputation: 59274
I'd first convert your intervals to two separate columns for lower bound
(lb) and upper bound
(ub)
df = (df.join(df['Class Interval'].str.split('-', expand=True)
.apply(pd.to_numeric)
.rename(columns={0: 'lb', 1: 'ub'}))
.drop('Class Interval', 1))
Then, looks like you can simply write out the formula straight away
m = len(df)//2
gmedian = df.loc[m, 'lb'] + ((df['𝑓𝑖'].sum()/2 - df.loc[m - 1, 'Cumulative 𝑓𝑖'])/(df.loc[m, '𝑓𝑖']))*(df['ub'] - df['lb']).loc[m]
Or, in a more didactic way,
L = df.loc[m, 'lb']
N = df['𝑓𝑖'].sum()
F = df.loc[m - 1, 'Cumulative 𝑓𝑖']
f = df.loc[m, '𝑓𝑖']
C = (df['ub'] - df['lb']).loc[m]
gmedian = L + ((N/2 - F)/(f))*C
Outputs
29.0625
Upvotes: 2