Reputation:
I have the following data:
[4.1, 4.1, 4.1, 4.2, 4.3, 4.3, 4.4, 4.5, 4.6, 4.6, 4.8, 4.9, 5.1, 5.1, 5.2, 5.2, 5.3, 5.3, 5.3, 5.4, 5.4, 5.5, 5.6, 5.6, 5.6, 5.7, 5.8, 5.9, 6.2, 6.2, 6.2, 6.3, 6.4, 6.4, 6.5, 6.6, 6.7, 6.7, 6.8, 6.8]
I need to build its count/frequency table like this based on the data above:
4.1 - 4.5: 8
4.6 - 5.0: 4
5.1 - 5.5: 10
5.6 - 6.0: 6
6.1 - 6.5: 7
6.6 - 7.0: 5
The closest I can get is the following result:
counts freqs
categories
[4.1, 4.6) 8 0.200
[4.6, 5.1) 4 0.100
[5.1, 5.6) 10 0.250
[5.6, 6.1) 6 0.150
[6.1, 6.6) 7 0.175
[6.6, 7.1) 5 0.125
Through this code:
sr = [4.1, 4.1, 4.1, 4.2, 4.3, 4.3, 4.4, 4.5, 4.6, 4.6, 4.8, 4.9, 5.1, 5.1, 5.2, 5.2, 5.3, 5.3, 5.3, 5.4, 5.4, 5.5, 5.6, 5.6, 5.6, 5.7, 5.8, 5.9, 6.2, 6.2, 6.2, 6.3, 6.4, 6.4, 6.5, 6.6, 6.7, 6.7, 6.8, 6.8]
ncut = pd.cut(sr, [4.1, 4.6, 5.1, 5.6, 6.1, 6.6, 7.1],right=False)
srpd = pd.DataFrame(ncut.describe())
I need to create a new column, which is the median of the "categories" value (e.g. for "[4.1, 4.6)", this contains the count / frequency of data from 4.1 to 4.5 (not including 4.6)), So I need to get (4.1 + 4.5) / 2, which is equal to 4.3.
Here are my questions:
1) How do I access the values under the "categories" index to use it for computation like above?
2) Is there a way to reflect the range in this way: 4.1 - 4.5, 4.6 to 5.0, etc..?
3) Is there an easier way to compute for mean, median, mode, etc for grouped data like these? or do I have to create my own functions for these in Python?
Thanks
Upvotes: 4
Views: 5832
Reputation: 153460
Let's try:
l = [4.1, 4.1, 4.1, 4.2, 4.3, 4.3, 4.4, 4.5, 4.6, 4.6, 4.8, 4.9,
5.1, 5.1, 5.2, 5.2, 5.3, 5.3, 5.3, 5.4, 5.4, 5.5, 5.6, 5.6,
5.6, 5.7, 5.8, 5.9, 6.2, 6.2, 6.2, 6.3, 6.4, 6.4, 6.5, 6.6,
6.7, 6.7, 6.8, 6.8]
s = pd.Series(l)
bins = [4.1, 4.6, 5.1, 5.6, 6.1, 6.6, 7.1]
#Python 3.6+ f-string
labels = [f'{i}-{j-.1}' for i,j in zip(bins,bins[1:])]
(pd.concat([pd.cut(s, bins=bins, labels=labels, right=False),s],axis=1)
.groupby(0)[1]
.agg(['mean','median', pd.Series.mode, 'std'])
.rename_axis('categories')
.reset_index())
Output:
categories mean median mode std
0 4.1-4.5 4.250000 4.25 4.1 0.151186
1 4.6-5.0 4.725000 4.70 4.6 0.150000
2 5.1-5.5 5.280000 5.30 5.3 0.131656
3 5.6-6.0 5.700000 5.65 5.6 0.126491
4 6.1-6.5 6.314286 6.30 6.2 0.121499
5 6.6-7.0 6.720000 6.70 [6.7, 6.8] 0.083666
Upvotes: 1
Reputation: 18647
What about the following for your bins and labels issue:
bins = [4.1, 4.6, 5.1, 5.6, 6.1, 6.6, 7.1]
labels = ['{}-{}'.format(x, y-.1) for x, y in zip(bins[:], bins[1:])]
Then instead of your values as a list, make them a Series
sr = pd.Series([4.1, 4.1, 4.1, 4.2, 4.3, 4.3, 4.4, 4.5, 4.6, 4.6, 4.8, 4.9, 5.1,
5.1, 5.2, 5.2, 5.3, 5.3, 5.3, 5.4, 5.4, 5.5, 5.6, 5.6, 5.6, 5.7,
5.8, 5.9, 6.2, 6.2, 6.2, 6.3, 6.4, 6.4, 6.5, 6.6, 6.7, 6.7, 6.8, 6.8])
ncut = pd.cut(sr, bins=bins, labels=labels, right=False)
Define a lambda
function to calculate the frequency
freq = lambda x: len(x) / x.sum()
freq.__name__ = 'freq'
Finally, use concat
, groupby
and agg
to get your summary statistics per bin
pd.concat([ncut, sr], axis=1).groupby(0).agg(['size', 'std', 'mean', freq])
Upvotes: 3
Reputation:
I kind of figured out a noob way to do this:
def buildFreqTable(data, width, numclass, pw):
data.sort()
minrange = []
maxrange = []
x_med = []
count = []
# Since data is already sorted, take the lowest value to jumpstart the creation of ranges
f_data = data[0]
for i in range(0,numclass):
# minrange holds the minimum value for that row
minrange.append(f_data)
# maxrange holds the maximum value for that row
maxrange.append(f_data + (width - pw))
# Compute for range's median
minmax_median = (minrange[i] + maxrange[i]) / 2
x_med.append(minmax_median)
# initialize count per numclass to 0, this will be incremented later
count.append(0)
f_data = f_data + width
# Tally the frequencies
for x in data:
for i in range(0,6):
if (x>=minrange[i] and x<=maxrange[i]):
count[i] = count[i] + 1
# Now, create the pandas dataframe for easier manipulation
freqtable = pd.DataFrame()
freqtable['minrange'] = minrange
freqtable['maxrange'] = maxrange
freqtable['x'] = x_med
freqtable['count'] = count
buildFreqTable(sr, 0.5, 6, 0.1)
It gives off the following:
minrange maxrange x count
0 4.1 4.5 4.3 8
1 4.6 5.0 4.8 4
2 5.1 5.5 5.3 10
3 5.6 6.0 5.8 6
4 6.1 6.5 6.3 7
5 6.6 7.0 6.8 5
Though I am still curious if there is an easier way to do this, or if anyone could refactor my code to be more "pro-like" Thanks
Upvotes: 1