Reputation: 17
I am exploring different python libraries and I wondering how to find approximate median value of the grouped data set. Here is a table for a reference.
Age | frequency |
---|---|
1-10 | 123 |
10-20 | 350 |
20-30 | 200 |
30-40 | 1700 |
40-50 | 360 |
50-60 | 60 |
Is there any function or do I need to hard code the formula and then have a loop going over each row and column of the dataset ?
Thanks in advance.
Upvotes: 1
Views: 1472
Reputation: 3
The solution of Zia Beheshti works very well, in cases where the frequency in table is not zero. freq_median should be checked at the beginning:
freq_median = df['frequency'][index + 1]
while (freq_median==0):
index+=1
freq_median = df['frequency'][index + 1]
It will prevent finding median as inf
.
Upvotes: 0
Reputation: 26
If you want to approximate median for discrete grouped data, you can approximate the median of the entire data set by interpolation using the formula:
median = L + interval * (N / 2 - CF) / F
L = lower limit of the median interval
N = total number of data points
CF = number of data points below the median interval
F = number of data points in the median interval
# Approximating median by pure python and pandas functions
import pandas as pd
df = pd.DataFrame.from_dict({'low_range':[1,11,21,31,41,51], 'high_range':[10,20,30,40,50,60], 'frequency':[123,350,200,1700,360,60]})
N = df['frequency'].sum()
# calulating L1
index = abs(df['frequency'].cumsum() - N/2).idxmin()
L1 = df['low_range'][index + 1]
cumsum_before = df['frequency'].cumsum()[index]
freq_medain = df['frequency'][index + 1]
width = df['high_range'][index + 1] - df['low_range'][index + 1] + 1
median = L1 + (N/2 - cumsum_before ) / freq_medain * width
print("L1 = {} , cumsum_before = {}, freq_medain = {}, width = {}".format(L1, cumsum_before, freq_medain, width ) )
print("Approximated median = ", median)
L1 = 31 , cumsum_before = 673, freq_medain = 1700, width = 10
Approximated median = 35.25588235294118
If you have continuous data, you can use median_grouped function in statistics package.
# Approximating median by statistics grouped_median for continuous values and fixed intervals
import statistics as st
import pandas as pd
df = pd.DataFrame.from_dict({'low_range':[1,10,21,31,41,51], 'high_range':[10,21,31,41,51,60], 'frequency':[123,350,200,1700,360,60]})
X = ((df['low_range'] + df['high_range'])/2).tolist()
f = df['frequency'].tolist()
# repeating values based on their frequencies
Y = [item for i,item in enumerate(X)
for count in range(f[i])]
width = df['high_range'][0] - df['low_range'][0] + 1
median = st.median_grouped(Y, width)
print("Approximated median = ", median)
Approximated median = 35.25588235294118
Upvotes: 1