Converting a Spark DataFrame for ML processing

Question

I have written the following code to feed data to a machine learning algorithm in Spark 2.3. The code below runs fine. I need to enhance this code to be able to convert not just 3 columns but any number of columns, uploaded via the csv file. For instance, if I had loaded 5 columns, how can I put them automatically in the Vector.dense command below, or some other way to generate the same end result? Does anyone know how this can be done?

val data2 = spark.read.format("csv").option("header", 
"true").load("/data/c7.csv")
val goodBadRecords = data2.map(
  row =>{ 
  val n0 = row(0).toString.toLowerCase().toDouble
  val n1 = row(1).toString.toLowerCase().toDouble
  val n2 = row(2).toString.toLowerCase().toDouble
  val n3 = row(3).toString.toLowerCase().toDouble  
  (n0, Vectors.dense(n1,n2,n3))    
 }
).toDF("label", "features")

Thanks

Regards,

Adeel

werner · Accepted Answer

A VectorAssembler can do the job:

VectorAssembler is a transformer that combines a given list of columns into a single vector column. It is useful for combining raw features [...] into a single feature vector

Based on your code, the solution would look like:

val data2 = spark.read.format("csv")
  .option("header","true")
  .option("inferSchema", "true") //1
  .load("/data/c7.csv")

val fields = data2.schema.fieldNames

val assembler = new VectorAssembler()
  .setInputCols(fields.tail) //2
  .setOutputCol("features") //3

val goodBadRecords = assembler.transform(data2)
  .withColumn("label", col(fields(0))) //4
  .drop(fields:_*) //5

Remarks:

A schema is necessary for the input data, as the VectorAssembler only accepts the following input column types: all numeric types, boolean type, and vector type (same link). You seem to have a csv with doubles, so infering the schema should work. But of course, any other method to transform the string data to doubles is also ok.
Use all but the first column as input for the VectorAssembler
Name the result column of the VectorAssembler features
Create a new column called label as copy of the first column
Drop all orginal columns. This last step is optional as the learning algorithm usually only looks at the label and feature column and ignores all other columns

Converting a Spark DataFrame for ML processing

Answers (1)

Related Questions