Greg
Greg

Reputation: 39

how to define features column in spark ml

I am trying to run the spark logistic regression function (ml not mllib). I have a dataframe which looks like (just the first row shown)

+-----+--------+
|label|features|
+-----+--------+
|  0.0|  [60.0]|

(Right now just trying to keep it simple with only one dimension in the feature, but will expand later on.)

I run the following code - taken from the Spark ML documentation

import org.apache.spark.ml.classification.LogisticRegression
val lr = new LogisticRegression()
  .setMaxIter(10)
  .setRegParam(0.3)
  .setElasticNetParam(0.8)

val lrModel = lr.fit(df)

This gives me the error -

org.apache.spark.SparkException: Values to assemble cannot be null.

I'm not sure how to fix this error. I looked at sample_libsvm_data.txt which is in the spark github repo and used in some of the examples in the spark ml documentation. That dataframe looks like

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  0.0|(692,[127,128,129...|
|  1.0|(692,[158,159,160...|
|  1.0|(692,[124,125,126...|

Based on this example, my data looks like it should be in the right format, with one issue. Is 692 the number of features? Seems rather dumb if so - spark should just be able to look at the length of the feature vector to see how many features there are. If I do need to add the number of features, how would I do that? (Pretty new to Scala/Java)

Cheers

Upvotes: 1

Views: 2104

Answers (1)

Manish Mishra
Manish Mishra

Reputation: 848

  1. This error is thrown by VectorAssembler when any of the features are null. Please verify that you rows doesn't contain null values. If there are null values you must convert it into a default numeric feature before VectorAssembling.

  2. Regarding format of sample_libsvm_data.txt, Its is in stored in in a sparse array/matrix form. Where data is represented as: 0 128:51 129:159 130:253 (Where 0 is label and the subsequent column contains index:numeric_feature format.

You can form your single feature dataframe in the following way using Vector class as follow (I ran it on 1.6.1 shell):

import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.ml.classification.LogisticRegression

val training1 = sqlContext.createDataFrame(Seq(
  (1.0, Vectors.dense(3.0)),
  (0.0, Vectors.dense(3.0))) 
).toDF("label", "features")

val lr = new         LogisticRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8)
val model1 = lr.fit(training)

For more, you can check examples at: https://spark.apache.org/docs/1.6.1/ml-guide.html#dataframe (Refer to section Code examples)

Upvotes: 1

Related Questions