Reputation: 1448
Is there something like R's table function in Julia? I've read about xtab
, but do not know how to use it.
Suppose we have R's data.frame
rdata
which col6
is of the Factor
type.
R sample code:
rdata <- read.csv("mycsv.csv") #1
table(rdata$col6) #2
In order to read data and make factors in Julia I do it like this:
using DataFrames
jldata = readtable("mycsv.csv", makefactors=true) #1 :col6 will be now pooled.
..., but how to build R's table like in julia (how to achieve #2)?
Upvotes: 8
Views: 3389
Reputation: 11128
I believe, "by" is depreciated in Julia as of 1.5.3 (It says: ERROR: ArgumentError: by function was removed from DataFrames.jl).
So here are some alternatives, we can use split apply combine to do a cross tabs as well or use FreqTables.
Using Split Combine:
Example 1 - SingleColumn:
using RDatasets
using DataFrames
mtcars = dataset("datasets", "mtcars")
## To do a table on cyl column
gdf = groupby(mtcars, :Cyl)
combine(gdf, nrow)
Output:
# 3×2 DataFrame
# Row │ Cyl nrow
# │ Int64 Int64
# ─────┼──────────────
# 1 │ 6 7
# 2 │ 4 11
# 3 │ 8 14
Example 2 - CrossTabs Between 2 columns:
## we have to just change the groupby code a little bit and rest is same
gdf = groupby(mtcars, [:Cyl, :AM])
combine(gdf, nrow)
Output:
#6×3 DataFrame
# Row │ Cyl AM nrow
# │ Int64 Int64 Int64
#─────┼─────────────────────
# 1 │ 6 1 3
# 2 │ 4 1 8
# 3 │ 6 0 4
# 4 │ 8 0 12
# 5 │ 4 0 3
# 6 │ 8 1 2
Also on a side note if you don't like the name as nrow on top, you can use :
combine(gdf, nrow => :Count)
to change the name to Count
Alternate way: Using FreqTables
You can use package, FreqTables
like below to do count and proportion very easily, to add it you can use Pkg.add("FreqTables")
:
## Cross tab between cyl and am
freqtable(mtcars.Cyl, mtcars.AM)
## Proportion between cyl and am
prop(freqtable(mtcars.Cyl, mtcars.AM))
## with margin like R you can use it too in this (columnwise proportion: margin=2)
prop(freqtable(mtcars.Cyl, mtcars.AM), margins=2)
## with margin for rowwise proportion: margin = 1
prop(freqtable(mtcars.Cyl, mtcars.AM), margins=1)
Outputs:
## count cross tabs
#3×2 Named Array{Int64,2}
#Dim1 ╲ Dim2 │ 0 1
#────────────┼───────
#4 │ 3 8
#6 │ 4 3
#8 │ 12 2
## proportion wise (overall)
#3×2 Named Array{Float64,2}
#Dim1 ╲ Dim2 │ 0 1
#────────────┼─────────────────
#4 │ 0.09375 0.25
#6 │ 0.125 0.09375
#8 │ 0.375 0.0625
## Column wise proportion
#3×2 Named Array{Float64,2}
#Dim1 ╲ Dim2 │ 0 1
#────────────┼───────────────────
#4 │ 0.157895 0.615385
#6 │ 0.210526 0.230769
#8 │ 0.631579 0.153846
## Row wise proportion
#3×2 Named Array{Float64,2}
#Dim1 ╲ Dim2 │ 0 1
#────────────┼───────────────────
#4 │ 0.272727 0.727273
#6 │ 0.571429 0.428571
#8 │ 0.857143 0.142857
Upvotes: 6
Reputation: 1448
I came to the conclusion that a similar effect can be achieved using by
:
Let jldata
consists of :gender
column.
julia> by(jldata, :gender, nrow)
3x2 DataFrames.DataFrame
| Row | gender | x1 |
|-----|----------|-------|
| 1 | NA | 175 |
| 2 | "female" | 40254 |
| 3 | "male" | 58574 |
Of course it's not a table
but at least I get the same data type as the datasource. Surprisingly by
seems to be faster than countmap
.
Upvotes: 7
Reputation: 1380
You can use the countmap
function from StatsBase.jl
to count the entries of a single variable. General cross tabulation and statistical tests for contingency tables are lacking at this point. As Ismael points out, this has been discussed in the issue tracker for StatsBase.jl
.
Upvotes: 8