Reputation: 1090
Given a dataframe "df" and a list of columns "colStr", is there a way in Spark Dataframe to extract or reference those columns from the data frame.
Here's an example -
val in = sc.parallelize(List(0, 1, 2, 3, 4, 5))
val df = in.map(x => (x, x+1, x+2)).toDF("c1", "c2", "c3")
val keyColumn = "c2" // this is either a single column name or a string of column names delimited by ','
val keyGroup = keyColumn.split(",").toSeq.map(x => col(x))
import org.apache.spark.sql.expressions.Window
import sqlContext.implicits._
val ranker = Window.partitionBy(keyGroup).orderBy($"c2")
val new_df= df.withColumn("rank", rank.over(ranker))
new_df.show()
The above errors out with
error: overloaded method value partitionBy with alternatives
(cols:org.apache.spark.sql.Column*)org.apache.spark.sql.expressions.WindowSpec <and>
(colName: String,colNames: String*)org.apache.spark.sql.expressions.WindowSpec
cannot be applied to (Seq[org.apache.spark.sql.Column])
Appreciate the help. Thanks!
Upvotes: 1
Views: 1347
Reputation: 214957
If you are trying to group data frame by the columns in the keyGroup
list, you can pass keyGroup: _*
as parameter to partitionBy
function:
val ranker = Window.partitionBy(keyGroup: _*).orderBy($"c2")
val new_df= df.withColumn("rank", rank.over(ranker))
new_df.show
+---+---+---+----+
| c1| c2| c3|rank|
+---+---+---+----+
| 0| 1| 2| 1|
| 5| 6| 7| 1|
| 2| 3| 4| 1|
| 4| 5| 6| 1|
| 3| 4| 5| 1|
| 1| 2| 3| 1|
+---+---+---+----+
Upvotes: 3