ecjb
ecjb

Reputation: 5449

Julia - describe() function display incomplete summary statistics

I'm trying basic data analysis with Julia

I'm following this tutorial with the train datasets that can be found here (the one named train_u6lujuX_CVtuZ9i.csv) with the following code:

using DataFrames, RDatasets, CSV, StatsBase
train = CSV.read("/Path/to/train_u6lujuX_CVtuZ9i.csv");
describe(train[:LoanAmount])

and get this output:

Summary Stats:
Length:         614
Type:           Union{Missing, Int64}
Number Unique:  204

instead of the output of the tutorial:

Summary Stats:
Mean:           146.412162
Minimum:        9.000000
1st Quartile:   100.000000
Median:         128.000000
3rd Quartile:   168.000000
Maximum:        700.000000
Length:         592
Type:           Int64
% Missing:      3.583062

Which also corresponds to the output of StatsBase.jl that the describe() function should give

Upvotes: 2

Views: 4660

Answers (1)

Bogumił Kamiński
Bogumił Kamiński

Reputation: 69819

This is how it is currently (in the current release) implemented in StatsBase.jl. In short train.LoanAmount does not have eltype that is subtype of Real and then StatsBase.jl uses a fallback method that only prints length, eltype and number of unique values. You can write describe(collect(skipmissing(train.LoanAmount))) to get summary statistics (except number of missings of course).

Actually, however, I would recommend you to use another approach. If you want to get a more verbose output on a single column use:

describe(train, :all, cols=:LoanAmount)

you will get an output that additionally is returned as a DataFrame so that you can not only see the statistics but also access them.

Option :all will print all statistics please refer to describe docstring in DataFrames.jl to see available options.

You can find some examples of using this function on a current release of DataFrames.jl here.

Upvotes: 6

Related Questions