Reputation: 827
I have written the following code to feed data to a machine learning algorithm in Spark 2.3. The code below runs fine. I need to enhance this code to be able to convert not just 3 columns but any number of columns, uploaded via the csv file. For instance, if I had loaded 5 columns, how can I put them automatically in the Vector.dense command below, or some other way to generate the same end result? Does anyone know how this can be done?
val data2 = spark.read.format("csv").option("header",
"true").load("/data/c7.csv")
val goodBadRecords = data2.map(
row =>{
val n0 = row(0).toString.toLowerCase().toDouble
val n1 = row(1).toString.toLowerCase().toDouble
val n2 = row(2).toString.toLowerCase().toDouble
val n3 = row(3).toString.toLowerCase().toDouble
(n0, Vectors.dense(n1,n2,n3))
}
).toDF("label", "features")
Thanks
Regards,
Adeel
Upvotes: 0
Views: 193
Reputation: 14845
A VectorAssembler can do the job:
VectorAssembler is a transformer that combines a given list of columns into a single vector column. It is useful for combining raw features [...] into a single feature vector
Based on your code, the solution would look like:
val data2 = spark.read.format("csv")
.option("header","true")
.option("inferSchema", "true") //1
.load("/data/c7.csv")
val fields = data2.schema.fieldNames
val assembler = new VectorAssembler()
.setInputCols(fields.tail) //2
.setOutputCol("features") //3
val goodBadRecords = assembler.transform(data2)
.withColumn("label", col(fields(0))) //4
.drop(fields:_*) //5
Remarks:
Upvotes: 1