Reputation: 6615
I see that the pandas library has a Describe by
function which returns some useful statistics. However, is there a way to add additional rows to the output such as standard deviation (.std) and median absolute deviation (.mad) or the count of unique values?
I get df.describe()
but I'm unable to find out how to add these additional summary things
Upvotes: 9
Views: 15233
Reputation: 225
You might want to use a "custom describe" as it is suggested in Pandas' documentation (Link). It says: "With .agg() it is possible to easily create a custom describe function, similar to the built in describe function". The website also gives an example, which I have enriched a bit:
pets = [{'Dogs': 1, 'Cats': 2}, {'Dogs': 3, 'Cats': 1}, {'Dogs': 2, 'Cats': 5}]
df = pd.DataFrame(pets)
q_25 = partial(pd.Series.quantile, q=0.25)
q_25.__name__ = "25%"
q_75 = partial(pd.Series.quantile, q=0.75)
q_75.__name__ = "75%"
descDf = df.agg(["count", "mean", "std", "min", q_25, "median", q_75, "max"])
print(descDf)
Upvotes: 0
Reputation: 294488
the default describe
looks like this:
np.random.seed([3,1415])
df = pd.DataFrame(np.random.rand(100, 5), columns=list('ABCDE'))
df.describe()
A B C D E
count 100.000000 100.000000 100.000000 100.000000 100.000000
mean 0.495871 0.472939 0.455570 0.503899 0.451341
std 0.303589 0.291968 0.294984 0.269936 0.284666
min 0.006453 0.001559 0.001068 0.015311 0.009526
25% 0.239379 0.219141 0.196251 0.294371 0.202956
50% 0.529596 0.456548 0.376558 0.532002 0.432936
75% 0.759452 0.739666 0.665563 0.730702 0.686793
max 0.999799 0.994510 0.997271 0.981551 0.979221
Updated for pandas > 0.21.0
I'd make my own describe
like below. It should be obvious how to add more.
def describe(df, stats):
d = df.describe()
return d.append(df.reindex(d.columns, axis = 1).agg(stats))
describe(df, ['skew', 'mad', 'kurt'])
A B C D E
count 100.000000 100.000000 100.000000 100.000000 100.000000
mean 0.495871 0.472939 0.455570 0.503899 0.451341
std 0.303589 0.291968 0.294984 0.269936 0.284666
min 0.006453 0.001559 0.001068 0.015311 0.009526
25% 0.239379 0.219141 0.196251 0.294371 0.202956
50% 0.529596 0.456548 0.376558 0.532002 0.432936
75% 0.759452 0.739666 0.665563 0.730702 0.686793
max 0.999799 0.994510 0.997271 0.981551 0.979221
skew -0.014942 0.048054 0.247244 -0.125151 0.066156
mad 0.267730 0.249968 0.254351 0.228558 0.242874
kurt -1.323469 -1.223123 -1.095713 -1.083420 -1.148642
Updated for pandas 0.20
I'd make my own describe
like below. It should be obvious how to add more.
def describe(df, stats):
d = df.describe()
return d.append(df.reindex_axis(d.columns, 1).agg(stats))
describe(df, ['skew', 'mad', 'kurt'])
A B C D E
count 100.000000 100.000000 100.000000 100.000000 100.000000
mean 0.495871 0.472939 0.455570 0.503899 0.451341
std 0.303589 0.291968 0.294984 0.269936 0.284666
min 0.006453 0.001559 0.001068 0.015311 0.009526
25% 0.239379 0.219141 0.196251 0.294371 0.202956
50% 0.529596 0.456548 0.376558 0.532002 0.432936
75% 0.759452 0.739666 0.665563 0.730702 0.686793
max 0.999799 0.994510 0.997271 0.981551 0.979221
skew -0.014942 0.048054 0.247244 -0.125151 0.066156
mad 0.267730 0.249968 0.254351 0.228558 0.242874
kurt -1.323469 -1.223123 -1.095713 -1.083420 -1.148642
Old Answer
def describe(df):
return pd.concat([df.describe().T,
df.mad().rename('mad'),
df.skew().rename('skew'),
df.kurt().rename('kurt'),
], axis=1).T
describe(df)
A B C D E
count 100.000000 100.000000 100.000000 100.000000 100.000000
mean 0.495871 0.472939 0.455570 0.503899 0.451341
std 0.303589 0.291968 0.294984 0.269936 0.284666
min 0.006453 0.001559 0.001068 0.015311 0.009526
25% 0.239379 0.219141 0.196251 0.294371 0.202956
50% 0.529596 0.456548 0.376558 0.532002 0.432936
75% 0.759452 0.739666 0.665563 0.730702 0.686793
max 0.999799 0.994510 0.997271 0.981551 0.979221
mad 0.267730 0.249968 0.254351 0.228558 0.242874
skew -0.014942 0.048054 0.247244 -0.125151 0.066156
kurt -1.323469 -1.223123 -1.095713 -1.083420 -1.148642
Upvotes: 17
Reputation: 101
The answer from piRSquared makes the most sense to me, but I get a deprecation warning about reindex_axis in Python 3.5. This works for me:
stats = data.describe()
stats.loc['IQR'] = stats.loc['75%'] - stats.loc['25%'] # appending interquartile range instead of recalculating it
stats = stats.append(data.reindex(stats.columns, axis=1).agg(['skew', 'mad', 'kurt']))
Upvotes: 5
Reputation: 25659
Try this:
df.describe()
num1 num2
count 3.0 3.0
mean 2.0 5.0
std 1.0 1.0
min 1.0 4.0
25% 1.5 4.5
50% 2.0 5.0
75% 2.5 5.5
max 3.0 6.0
Build a second DataFrame.
pd.DataFrame(df.mad() , columns = ["Mad"] ).T
num1 num2
Mad 0.666667 0.666667
Join the two DataFrames.
pd.concat([df.describe(),pd.DataFrame(df.mad() , columns = ["Mad"] ).T ])
num1 num2
count 3.000000 3.000000
mean 2.000000 5.000000
std 1.000000 1.000000
min 1.000000 4.000000
25% 1.500000 4.500000
50% 2.000000 5.000000
75% 2.500000 5.500000
max 3.000000 6.000000
Mad 0.666667 0.666667
Upvotes: 2