Reputation: 6615

pandas describe by - additional parameters

I see that the pandas library has a Describe by function which returns some useful statistics. However, is there a way to add additional rows to the output such as standard deviation (.std) and median absolute deviation (.mad) or the count of unique values?

I get df.describe() but I'm unable to find out how to add these additional summary things

Upvotes: 9

Answers (4)

Alexander

Reputation: 225

You might want to use a "custom describe" as it is suggested in Pandas' documentation (Link). It says: "With .agg() it is possible to easily create a custom describe function, similar to the built in describe function". The website also gives an example, which I have enriched a bit:

pets = [{'Dogs': 1, 'Cats': 2}, {'Dogs': 3, 'Cats': 1}, {'Dogs': 2, 'Cats': 5}]
df = pd.DataFrame(pets)

q_25 = partial(pd.Series.quantile, q=0.25)
q_25.__name__ = "25%"
q_75 = partial(pd.Series.quantile, q=0.75)
q_75.__name__ = "75%"
descDf = df.agg(["count", "mean", "std", "min", q_25, "median", q_75, "max"])
print(descDf)

Upvotes: 0

piRSquared

Reputation: 294488

the default describe looks like this:

np.random.seed([3,1415])
df = pd.DataFrame(np.random.rand(100, 5), columns=list('ABCDE'))

df.describe()

                A           B           C           D           E
count  100.000000  100.000000  100.000000  100.000000  100.000000
mean     0.495871    0.472939    0.455570    0.503899    0.451341
std      0.303589    0.291968    0.294984    0.269936    0.284666
min      0.006453    0.001559    0.001068    0.015311    0.009526
25%      0.239379    0.219141    0.196251    0.294371    0.202956
50%      0.529596    0.456548    0.376558    0.532002    0.432936
75%      0.759452    0.739666    0.665563    0.730702    0.686793
max      0.999799    0.994510    0.997271    0.981551    0.979221

Updated for pandas > 0.21.0
I'd make my own describe like below. It should be obvious how to add more.

def describe(df, stats):
    d = df.describe()
    return d.append(df.reindex(d.columns, axis = 1).agg(stats))

describe(df, ['skew', 'mad', 'kurt'])

                A           B           C           D           E
count  100.000000  100.000000  100.000000  100.000000  100.000000
mean     0.495871    0.472939    0.455570    0.503899    0.451341
std      0.303589    0.291968    0.294984    0.269936    0.284666
min      0.006453    0.001559    0.001068    0.015311    0.009526
25%      0.239379    0.219141    0.196251    0.294371    0.202956
50%      0.529596    0.456548    0.376558    0.532002    0.432936
75%      0.759452    0.739666    0.665563    0.730702    0.686793
max      0.999799    0.994510    0.997271    0.981551    0.979221
skew    -0.014942    0.048054    0.247244   -0.125151    0.066156
mad      0.267730    0.249968    0.254351    0.228558    0.242874
kurt    -1.323469   -1.223123   -1.095713   -1.083420   -1.148642

Updated for pandas 0.20
I'd make my own describe like below. It should be obvious how to add more.

def describe(df, stats):
    d = df.describe()
    return d.append(df.reindex_axis(d.columns, 1).agg(stats))

describe(df, ['skew', 'mad', 'kurt'])

                A           B           C           D           E
count  100.000000  100.000000  100.000000  100.000000  100.000000
mean     0.495871    0.472939    0.455570    0.503899    0.451341
std      0.303589    0.291968    0.294984    0.269936    0.284666
min      0.006453    0.001559    0.001068    0.015311    0.009526
25%      0.239379    0.219141    0.196251    0.294371    0.202956
50%      0.529596    0.456548    0.376558    0.532002    0.432936
75%      0.759452    0.739666    0.665563    0.730702    0.686793
max      0.999799    0.994510    0.997271    0.981551    0.979221
skew    -0.014942    0.048054    0.247244   -0.125151    0.066156
mad      0.267730    0.249968    0.254351    0.228558    0.242874
kurt    -1.323469   -1.223123   -1.095713   -1.083420   -1.148642

Old Answer

def describe(df):
    return pd.concat([df.describe().T,
                      df.mad().rename('mad'),
                      df.skew().rename('skew'),
                      df.kurt().rename('kurt'),
                     ], axis=1).T

describe(df)

                A           B           C           D           E
count  100.000000  100.000000  100.000000  100.000000  100.000000
mean     0.495871    0.472939    0.455570    0.503899    0.451341
std      0.303589    0.291968    0.294984    0.269936    0.284666
min      0.006453    0.001559    0.001068    0.015311    0.009526
25%      0.239379    0.219141    0.196251    0.294371    0.202956
50%      0.529596    0.456548    0.376558    0.532002    0.432936
75%      0.759452    0.739666    0.665563    0.730702    0.686793
max      0.999799    0.994510    0.997271    0.981551    0.979221
mad      0.267730    0.249968    0.254351    0.228558    0.242874
skew    -0.014942    0.048054    0.247244   -0.125151    0.066156
kurt    -1.323469   -1.223123   -1.095713   -1.083420   -1.148642

Upvotes: 17

bzip2

Reputation: 101

The answer from piRSquared makes the most sense to me, but I get a deprecation warning about reindex_axis in Python 3.5. This works for me:

    stats = data.describe()
    stats.loc['IQR'] = stats.loc['75%'] - stats.loc['25%'] # appending interquartile range instead of recalculating it
    stats = stats.append(data.reindex(stats.columns, axis=1).agg(['skew', 'mad', 'kurt']))

Upvotes: 5

Merlin

Reputation: 25659

Try this:

 df.describe()

      num1  num2
count   3.0   3.0
mean    2.0   5.0
std     1.0   1.0
min     1.0   4.0
25%     1.5   4.5
50%     2.0   5.0
75%     2.5   5.5
max     3.0   6.0

Build a second DataFrame.

 pd.DataFrame(df.mad() , columns = ["Mad"] ).T

         num1      num2
Mad  0.666667  0.666667

Join the two DataFrames.

 pd.concat([df.describe(),pd.DataFrame(df.mad() , columns = ["Mad"] ).T ])

          num1      num2
count  3.000000  3.000000
mean   2.000000  5.000000
std    1.000000  1.000000
min    1.000000  4.000000
25%    1.500000  4.500000
50%    2.000000  5.000000
75%    2.500000  5.500000
max    3.000000  6.000000
Mad    0.666667  0.666667

Upvotes: 2

pandas describe by - additional parameters

Answers (4)

Related Questions