Reputation: 337
Total Julia noob here (with basic knowledge of Python). I am trying to do linear regression and things I read suggest the GLM package. Here is some sample code I found here:
using DataFrames, GLM
y = 1:10
df = DataFrame(y = y, x1 = y.^2, x2 = y.^3)
sm = GLM.lm( @formula(y ~ x1 + x2), df )
coef(sm)
Can someone explain the syntax here? What does @formula
mean? Docs here say @foo
means a
macro which I guess is basically just a function, but where do I find the function/macro formula
? Just looking at the use here though, I would have thought it is maybe passing y ~ x1 + x2
(whatever that is) as the formula
argument to lm
? (similar to keyword arguments =
in python?)
Next, what is ~
here? General docs say ~
means negation but I'm not seeing how that makes here.
Is there a place in the GLM docs where all of this is explained? I'm not seeing that. Only seeing a few examples but not a full breakdown of each function and all of its arguments.
Upvotes: 4
Views: 486
Reputation: 13800
You have stumbled upon the @formula language that is defined in the StatsModels.jl package and implemented in many statistics/econometrics related packages across the Julia ecosystem.
As you say, @formula
is a macro, which transforms the expression given to it (here y ~ x1 + x2
) into some other Julia expression. If you want to find out what happens when a macro gets called in Julia - which I admit can often look like magic to new (and sometimes experienced!) users - the @macroexpand
macro can help you. In this case:
julia> @macroexpand @formula(y ~ x1 + x2)
:(StatsModels.Term(:y) ~ StatsModels.Term(:x1) + StatsModels.Term(:x2))
The result above is the expression constructed by the @formula
macro. We see that the variables in our formula macro are transformed into StatsModels.Term
objects. If we were to use StatsModels
directly, we could construct this ourselves by doing:
julia> Term(:y) ~ Term(:x1) + Term(:x2)
FormulaTerm
Response:
y(unknown)
Predictors:
x1(unknown)
x2(unknown)
julia> (Term(:y) ~ Term(:x1) + Term(:x2)) == @formula(y ~ x1 + x2)
true
Now what is going on with ~
, which as you say can be used for negation in Julia? What has happened here is that StatsModels
has defined methods for ~
(which in Julia is and infix operator, that means essentially it is a function that can be written in between its arguments rather than having to be called with its arguments in brackets:
julia> (Term(:y) ~ Term(:x)) == ~(Term(:y), Term(:x))
true
So writing y::Term ~ x::Term
is the same as calling ~(y::Term, x::Term)
, and this method for calling ~
with terms on the left and right hand side is defined by StatsModels
(see method no. 6 below):
julia> methods(~)
# 6 methods for generic function "~":
[1] ~(x::BigInt) in Base.GMP at gmp.jl:542
[2] ~(::Missing) in Base at missing.jl:100
[3] ~(x::Bool) in Base at bool.jl:39
[4] ~(x::Union{Int128, Int16, Int32, Int64, Int8, UInt128, UInt16, UInt32, UInt64, UInt8}) in Base at int.jl:254
[5] ~(n::Integer) in Base at int.jl:138
[6] ~(lhs::Union{AbstractTerm, Tuple{Vararg{AbstractTerm,N}} where N}, rhs::Union{AbstractTerm, Tuple{Vararg{AbstractTerm,N}} where N}) in StatsModels at /home/nils/.julia/packages/StatsModels/pMxlJ/src/terms.jl:397
Note that you also find the general negation meaning here (method 3 above, which defines the behaviour for calling ~
on a boolean argument and is in Base Julia).
I agree that the GLM.jl docs maybe aren't the most comprehensive in the world, but one of the reasons for that is that the whole machinery behind @formula
actually isn't a GLM.jl thing - so do check out the StatsModels docs linked above which are quite good I think.
Upvotes: 4