how to keep records information when working in ML

Question

I am basing this question on this one. The OP says 'This problem doesn't exist in ML as it uses DataFrame and I can simply add another column with the score to my original dataframe.' Can anyone tell me how to do this? I have tried:

val labeledData = data1.select("labels","hash-tfidf").rdd.map { row =>
  LabeledPoint(row.getAs[Double]("labels"), row.getAs[org.apache.spark.ml.linalg.SparseVector]("hash-tfidf"))
}

val scoreDF = model.transform(labeledData.toDS)

val dfPredictions = data1.withColumn("prediction", scoreDF.col("prediction"))

where data1 is my original dataframe with lots of columns. This errors with:

org.apache.spark.sql.AnalysisException: resolved attribute(s) prediction#1458 missing from ....[loads of fields I think from data1]...

What am I doing wrong?

Alper t. Turker · Accepted Answer

You don't need RDDs and you don't need LabeledPoint and you cannot add column from another DataFrame.

It is not clear what the model is, but I assume it's input column is features so you can either rename the column:

model.transform(data1.withColumnRenamed("hash-tfidf", "features"))

or configure model to accept hash-tfidf as input.

how to keep records information when working in ML

Answers (1)

Related Questions