Reputation: 359
I am trying to sum a list of columns in my Dataframe of type org.apache.spark.sql.DataFrame and create a new column 'sums' and dataframe 'out'.
I can do this quite easily if I list the columns by hand, for example, this works
val columnsToSum = List(col("led zeppelin"), col("lenny kravitz"), col("leona lewis"), col("lily allen"))
val out = df3.withColumn("sums", columnsToSum.reduce(_ + _))
However, if I wish to do this by pulling the column names directly from the dataframes the items in the list object are not the same and I am unable to do this, for example
val columnsToSum = df2.schema.fields.filter(f => f.dataType.isInstanceOf[StringType]).map(_.name).patch(0, Nil, 1).toList // arrays are mutable (remove "user" from list)
println(tmpArr)
>> List(a perfect circle, abba, ac/dc, adam green, aerosmith, afi, ...
// Trying the same method
val out = df3.withColumn("sums", columnsToSum.reduce(_ + _))
>> found : String
required: org.apache.spark.sql.Column
val out = df3.withColumn("sums", tmpArr.reduce(_ + _))found : String
required: org.apache.spark.sql.Column
val out = df3.withColumn("sums", tmpArr.reduce(_ + _))
How do I do this type of conversion? I've tried:
List(a perfect circle, abba, ac/dc, ...).map(_.Column)
List(a perfect circle, abba, ac/dc, ...).map(_.spark.sql.Column)
List(a perfect circle, abba, ac/dc, ...).map(_.org.apache.spark.sql.Column)
Which haven't worked Thanks in advance
Upvotes: 2
Views: 354
Reputation: 22635
You can get a column object from a string by using function col (you are actually already using it in your first snippet).
So this should work:
columnsToSum.map(col).reduce(_ + _)
or move verbose version:
columnsToSum.map(c => col(c)).reduce(_ + _)
Upvotes: 2