kathystehl
kathystehl

Reputation: 841

SparkR MLlib & spark.ml: least squares and glm optimization

Would anyone be able to explain how to specify optimization methods in the SparkR operation glm? When I try to fit an OLS model with glm, I can only specify "normal" or "auto" as the solver type. SparkR isn't able to interpret the solver specification "l-bfgs", leading me to believe that when I do specify "auto", SparkR simply assumes "normal" and then estimates the model coefficients analytically, using the LS normal equation.

Is fitting GLMs with stochastic gradient descent and L-BFGS not available in SparkR, or am I writing the following evaluation incorrectly?

m <- SparkR::glm(y ~ x1 + x2 + x3, data = df, solver = "l-bfgs")

There's plenty of documentation in Spark about using iterative methods to fit GLMs, e.g. LogisticRegressionWithLBFGS and LinearRegressionWithSGD (discussed here), but I haven't been able to find any such documentation for the R API. Is this simply not available in SparkR (i.e. are SparkR users constrained to solving analytically and, therefore, constrained in the size of our data), or am I missing something essential here? If it isn't currently available in SparkR, is it supposed to come out with SparkR 2.0.0?

Below, I create a toy data set and fit three models, each with a different solver specification:

x1 <- rnorm(n=200, mean=10, sd=2)
x2 <- rnorm(n=200, mean=17, sd=3)
x3 <- rnorm(n=200, mean=8, sd=1)
y <- 1 + .2 * x1 + .4 * x2 + .5 * x3 + rnorm(n=200, mean=0, sd=.1)
dat <- cbind.data.frame(y, x1, x2, x3)

df <- as.DataFrame(sqlContext, dat)

m1 <- SparkR::glm(y ~ x1 + x2 + x3, data = df, solver = "normal")
m2 <- SparkR::glm(y ~ x1 + x2 + x3, data = df, solver = "auto")
m3 <- SparkR::glm(y ~ x1 + x2 + x3, data = df, solver = "l-bfgs")

The first and second model result in the same parameter estimation values (supporting my assumption that SparkR is solving the normal equation when fitting both models and, consequently, the models are equivalent). SparkR is able to fit the third model, but when I try to print a summary of the GLM, I receive the following error:

summary(m3) error

For reference, I am doing this through AWS and have tried different versions of EMR, including the most recent (in case that makes a difference). Also, I am using Spark 1.6.1 (R API).

Upvotes: 2

Views: 632

Answers (1)

SpiritusPrana
SpiritusPrana

Reputation: 480

Spark 1.6.2 API documentation is here

solver:
The solver algorithm used for optimization, this can be "l-bfgs", "normal" and "auto". "l-bfgs" denotes Limited-memory BFGS which is a limited-memory quasi-Newton optimization method. "normal" denotes using Normal Equation as an analytical solution to the linear regression problem. The default value is "auto" which means that the solver algorithm is selected automatically.

To me - this looks worthy of a bug report on the Apache Spark Jira site.

Upvotes: 1

Related Questions