Smithey
Smithey

Reputation: 337

Trying to understand Julia syntax in linear regression code (GLM package)

Total Julia noob here (with basic knowledge of Python). I am trying to do linear regression and things I read suggest the GLM package. Here is some sample code I found here:

using DataFrames, GLM
y = 1:10 
df = DataFrame(y = y, x1 = y.^2, x2 = y.^3)
sm = GLM.lm( @formula(y ~ x1 + x2), df )
coef(sm)

Can someone explain the syntax here? What does @formula mean? Docs here say @foo means a macro which I guess is basically just a function, but where do I find the function/macro formula? Just looking at the use here though, I would have thought it is maybe passing y ~ x1 + x2 (whatever that is) as the formula argument to lm? (similar to keyword arguments = in python?)

Next, what is ~ here? General docs say ~ means negation but I'm not seeing how that makes here.

Is there a place in the GLM docs where all of this is explained? I'm not seeing that. Only seeing a few examples but not a full breakdown of each function and all of its arguments.

Upvotes: 4

Views: 486

Answers (1)

Nils Gudat
Nils Gudat

Reputation: 13800

You have stumbled upon the @formula language that is defined in the StatsModels.jl package and implemented in many statistics/econometrics related packages across the Julia ecosystem.

As you say, @formula is a macro, which transforms the expression given to it (here y ~ x1 + x2) into some other Julia expression. If you want to find out what happens when a macro gets called in Julia - which I admit can often look like magic to new (and sometimes experienced!) users - the @macroexpand macro can help you. In this case:

julia> @macroexpand @formula(y ~ x1 + x2)
:(StatsModels.Term(:y) ~ StatsModels.Term(:x1) + StatsModels.Term(:x2))

The result above is the expression constructed by the @formula macro. We see that the variables in our formula macro are transformed into StatsModels.Term objects. If we were to use StatsModels directly, we could construct this ourselves by doing:

julia> Term(:y) ~ Term(:x1) + Term(:x2)
FormulaTerm
Response:
  y(unknown)
Predictors:
  x1(unknown)
  x2(unknown)

julia> (Term(:y) ~ Term(:x1) + Term(:x2)) == @formula(y ~ x1 + x2)
true

Now what is going on with ~, which as you say can be used for negation in Julia? What has happened here is that StatsModels has defined methods for ~ (which in Julia is and infix operator, that means essentially it is a function that can be written in between its arguments rather than having to be called with its arguments in brackets:

julia> (Term(:y) ~ Term(:x)) == ~(Term(:y), Term(:x))
true

So writing y::Term ~ x::Term is the same as calling ~(y::Term, x::Term), and this method for calling ~ with terms on the left and right hand side is defined by StatsModels (see method no. 6 below):

julia> methods(~)
# 6 methods for generic function "~":
[1] ~(x::BigInt) in Base.GMP at gmp.jl:542
[2] ~(::Missing) in Base at missing.jl:100
[3] ~(x::Bool) in Base at bool.jl:39
[4] ~(x::Union{Int128, Int16, Int32, Int64, Int8, UInt128, UInt16, UInt32, UInt64, UInt8}) in Base at int.jl:254
[5] ~(n::Integer) in Base at int.jl:138
[6] ~(lhs::Union{AbstractTerm, Tuple{Vararg{AbstractTerm,N}} where N}, rhs::Union{AbstractTerm, Tuple{Vararg{AbstractTerm,N}} where N}) in StatsModels at /home/nils/.julia/packages/StatsModels/pMxlJ/src/terms.jl:397

Note that you also find the general negation meaning here (method 3 above, which defines the behaviour for calling ~ on a boolean argument and is in Base Julia).

I agree that the GLM.jl docs maybe aren't the most comprehensive in the world, but one of the reasons for that is that the whole machinery behind @formula actually isn't a GLM.jl thing - so do check out the StatsModels docs linked above which are quite good I think.

Upvotes: 4

Related Questions