how to define features column in spark ml

Question

I am trying to run the spark logistic regression function (ml not mllib). I have a dataframe which looks like (just the first row shown)

+-----+--------+
|label|features|
+-----+--------+
|  0.0|  [60.0]|

(Right now just trying to keep it simple with only one dimension in the feature, but will expand later on.)

I run the following code - taken from the Spark ML documentation

import org.apache.spark.ml.classification.LogisticRegression
val lr = new LogisticRegression()
  .setMaxIter(10)
  .setRegParam(0.3)
  .setElasticNetParam(0.8)

val lrModel = lr.fit(df)

This gives me the error -

org.apache.spark.SparkException: Values to assemble cannot be null.

I'm not sure how to fix this error. I looked at sample_libsvm_data.txt which is in the spark github repo and used in some of the examples in the spark ml documentation. That dataframe looks like

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  0.0|(692,[127,128,129...|
|  1.0|(692,[158,159,160...|
|  1.0|(692,[124,125,126...|

Based on this example, my data looks like it should be in the right format, with one issue. Is 692 the number of features? Seems rather dumb if so - spark should just be able to look at the length of the feature vector to see how many features there are. If I do need to add the number of features, how would I do that? (Pretty new to Scala/Java)

Cheers

how to define features column in spark ml

Answers (1)

Related Questions