Reputation: 1967
I have a dataframe of the following form (for example)
shopper_num,is_martian,number_of_items,count_pineapples,birth_country,tranpsortation_method
1,FALSE,0,0,MX,
2,FALSE,1,0,MX,
3,FALSE,0,0,MX,
4,FALSE,22,0,MX,
5,FALSE,0,0,MX,
6,FALSE,0,0,MX,
7,FALSE,5,0,MX,
8,FALSE,0,0,MX,
9,FALSE,4,0,MX,
10,FALSE,2,0,MX,
11,FALSE,0,0,MX,
12,FALSE,13,0,MX,
13,FALSE,0,0,CA,
14,FALSE,0,0,US,
How can I use Pandas to calculate summary statistics of each column (column data types are variable, some columns have no information
And then return the a dataframe of the form:
columnname, max, min, median,
is_martian, NA, NA, FALSE
So on and so on
Upvotes: 60
Views: 153210
Reputation: 5706
Now there is the pandas_profiling
package, which is a more complete alternative to df.describe()
.
If your pandas dataframe is df
, the below will return a complete analysis including some warnings about missing values, skewness, etc. It presents histograms and correlation plots as well.
import pandas_profiling
pandas_profiling.ProfileReport(df)
See the example notebook detailing the usage.
Upvotes: 29
Reputation: 2421
To clarify one point in @EdChum's answer, per the documentation, you can include the object columns by using df.describe(include='all')
. It won't provide many statistics, but will provide a few pieces of info, including count, number of unique values, top value. This may be a new feature, I don't know as I am a relatively new user.
Upvotes: 6
Reputation: 394319
describe
may give you everything you want otherwise you can perform aggregations using groupby and pass a list of agg functions: http://pandas.pydata.org/pandas-docs/stable/groupby.html#applying-multiple-functions-at-once
In [43]:
df.describe()
Out[43]:
shopper_num is_martian number_of_items count_pineapples
count 14.0000 14 14.000000 14
mean 7.5000 0 3.357143 0
std 4.1833 0 6.452276 0
min 1.0000 False 0.000000 0
25% 4.2500 0 0.000000 0
50% 7.5000 0 0.000000 0
75% 10.7500 0 3.500000 0
max 14.0000 False 22.000000 0
[8 rows x 4 columns]
Note that some columns cannot be summarised as there is no logical way to summarise them, for instance columns containing string data
As you prefer you can transpose the result if you prefer:
In [47]:
df.describe().transpose()
Out[47]:
count mean std min 25% 50% 75% max
shopper_num 14 7.5 4.1833 1 4.25 7.5 10.75 14
is_martian 14 0 0 False 0 0 0 False
number_of_items 14 3.357143 6.452276 0 0 0 3.5 22
count_pineapples 14 0 0 0 0 0 0 0
[4 rows x 8 columns]
Upvotes: 107