Reputation: 1
I have some data with multiple observations for a given Collector, Date, Sample, and Type where the observation values vary by ID.
import StringIO
import pandas as pd
data = """Collector,Date,Sample,Type,ID,Value
Emily,2014-06-20,201,HV,A,34
Emily,2014-06-20,201,HV,B,22
Emily,2014-06-20,201,HV,C,10
Emily,2014-06-20,201,HV,D,5
John,2014-06-22,221,HV,A,40
John,2014-06-22,221,HV,B,39
John,2014-06-22,221,HV,C,11
John,2014-06-22,221,HV,D,2
Emily,2014-06-23,203,HV,A,33
Emily,2014-06-23,203,HV,B,35
Emily,2014-06-23,203,HV,C,13
Emily,2014-06-23,203,HV,D,1
John,2014-07-01,218,HV,A,35
John,2014-07-01,218,HV,B,29
John,2014-07-01,218,HV,C,13
John,2014-07-01,218,HV,D,1
"""
>>> df = pd.read_csv(StringIO.StringIO(data), parse_dates="Date")
After doing some graphing with the data in this long format, I pivot it to a wide summary table format with columns for each ID.
>>> table = df.pivot_table(index=["Collector", "Date", "Sample", "Type"], columns="ID", values="Value")
ID A B C D
Collector Date Sample Type
Emily 2014-06-20 201 HV 34 22 10 5
2014-06-23 203 HV 33 35 13 1
John 2014-06-22 221 HV 40 39 11 2
2014-07-01 218 HV 35 29 13 1
However, I can't find a concise way to calculate and add some summary rows to the wide format data with mean, median, and maybe some custom aggregation function applied to each of the ID-based columns. This is what I want to end up with:
ID Collector Date Sample Type A B C D
0 Emily 2014-06-20 201 HV 34 22 10 5
2 John 2014-06-22 221 HV 40 39 11 2
1 Emily 2014-06-23 203 HV 33 35 13 1
3 John 2014-07-01 218 HV 35 29 13 1
4 mean 35.5 31.3 11.8 2.3
5 median 34.5 32.0 12.0 1.5
I tried things like calling mean
or median
on the summary table, but I end up with a Series rather than a row I can concatenate to the summary table. The summary rows I want are sort of like pivot_table margins, but the aggregation function is not sum
.
>>> table.mean()
ID
A 35.50
B 31.25
C 11.75
D 2.25
dtype: float64
>>> table.median()
ID
A 34.5
B 32.0
C 12.0
D 1.5
dtype: float64
Upvotes: 0
Views: 1233
Reputation: 880627
You could use aggfunc=[np.mean, np.median]
to compute both the means and the medians. Then you could use margins=True
to also obtain the means and medians for each column and for each row.
result = df.pivot_table(index=["Collector", "Date", "Sample", "Type"],
columns="ID", values="Value", margins=True,
aggfunc=[np.mean, np.median]).stack(level=0)
yields
ID A B C D All
Collector Date Sample Type
Emily 2014-06-20 201 HV mean 34.0 22.00 10.00 5.00 17.7500
median 34.0 22.00 10.00 5.00 16.0000
2014-06-23 203 HV mean 33.0 35.00 13.00 1.00 20.5000
median 33.0 35.00 13.00 1.00 23.0000
John 2014-06-22 221 HV mean 40.0 39.00 11.00 2.00 23.0000
median 40.0 39.00 11.00 2.00 25.0000
2014-07-01 218 HV mean 35.0 29.00 13.00 1.00 19.5000
median 35.0 29.00 13.00 1.00 21.0000
All mean 35.5 31.25 11.75 2.25 20.1875
median 34.5 32.00 12.00 1.50 17.5000
Yes, result
contains more data than you asked for, but
result.loc['All']
has the additional values:
ID A B C D All
Date Sample Type
mean 35.5 31.25 11.75 2.25 20.1875
median 34.5 32.00 12.00 1.50 17.5000
Or, you could further subselect result
to get just the rows you are looking for:
result.index.names = [u'Collector', u'Date', u'Sample', u'Type', u'aggfunc']
mask = result.index.get_level_values('aggfunc') == 'mean'
mask[-1] = True
result = result.loc[mask]
print(result)
yields
ID A B C D All
Collector Date Sample Type aggfunc
Emily 2014-06-20 201 HV mean 34.0 22.00 10.00 5.00 17.7500
2014-06-23 203 HV mean 33.0 35.00 13.00 1.00 20.5000
John 2014-06-22 221 HV mean 40.0 39.00 11.00 2.00 23.0000
2014-07-01 218 HV mean 35.0 29.00 13.00 1.00 19.5000
All mean 35.5 31.25 11.75 2.25 20.1875
median 34.5 32.00 12.00 1.50 17.5000
Upvotes: 2
Reputation: 52276
This might not be super clean, but you could assign to the new entries with .loc
.
In [131]: table_mean = table.mean()
In [132]: table_median = table.median()
In [134]: table.loc['Mean', :] = table_mean.values
In [135]: table.loc['Median', :] = table_median.values
In [136]: table
Out[136]:
ID A B C D
Collector Date Sample Type
Emily 2014-06-20 201 HV 34.0 22.00 10.00 5.00
2014-06-23 203 HV 33.0 35.00 13.00 1.00
John 2014-06-22 221 HV 40.0 39.00 11.00 2.00
2014-07-01 218 HV 35.0 29.00 13.00 1.00
Mean 35.5 31.25 11.75 2.25
Median 34.5 32.00 12.00 1.50
Upvotes: 0