Julia - describe() function display incomplete summary statistics

Question

I'm trying basic data analysis with Julia

I'm following this tutorial with the train datasets that can be found here (the one named train_u6lujuX_CVtuZ9i.csv) with the following code:

using DataFrames, RDatasets, CSV, StatsBase
train = CSV.read("/Path/to/train_u6lujuX_CVtuZ9i.csv");
describe(train[:LoanAmount])

and get this output:

Summary Stats:
Length:         614
Type:           Union{Missing, Int64}
Number Unique:  204

instead of the output of the tutorial:

Summary Stats:
Mean:           146.412162
Minimum:        9.000000
1st Quartile:   100.000000
Median:         128.000000
3rd Quartile:   168.000000
Maximum:        700.000000
Length:         592
Type:           Int64
% Missing:      3.583062

Which also corresponds to the output of StatsBase.jl that the describe() function should give

Bogumił Kamiński · Accepted Answer

This is how it is currently (in the current release) implemented in StatsBase.jl. In short train.LoanAmount does not have eltype that is subtype of Real and then StatsBase.jl uses a fallback method that only prints length, eltype and number of unique values. You can write describe(collect(skipmissing(train.LoanAmount))) to get summary statistics (except number of missings of course).

Actually, however, I would recommend you to use another approach. If you want to get a more verbose output on a single column use:

describe(train, :all, cols=:LoanAmount)

you will get an output that additionally is returned as a DataFrame so that you can not only see the statistics but also access them.

Option :all will print all statistics please refer to describe docstring in DataFrames.jl to see available options.

You can find some examples of using this function on a current release of DataFrames.jl here.

Julia - describe() function display incomplete summary statistics

Answers (1)

Related Questions