John
John

Reputation: 1591

Column wise operation in Spark

I have a dataframe like below:

 Id  priority1 priority2 priority3
  1    true      true     true
  2    false     true     true
  3    false     false    false

I need to create a new dataframe, where if first priority is true (order of priority priority1, priority2, priority3), I need to make all other priorities as false. Looking for a dynamic solution where I can add lot more priority columns.

Expected output from the example above:

Id  priority1 priority2 priority3   new_priority1 new_priority2 new_priority3
1    true      true     true         true           false         false
2    false     true     true         false          true          false
3    false     false    true         false          false         true

Upvotes: 1

Views: 445

Answers (1)

Shaido
Shaido

Reputation: 28322

First, put all the priority columns in an array and convert it to contain a single true value using an UDF. Then to get the array values into their own columns, use a foldLeft. Using the example input:

val df = Seq((1, true, true, true), (2, false, true, true), (3, false, false, false))
  .toDF("Id", "priority1", "priority2", "priority3")

The UDF and it's usage:

val convertPriorities = udf((prios: Seq[Boolean]) => {
  val falseSeq = Seq.fill(prios.length)(false)
  prios.indexOf(true) match {
    case -1 => falseSeq
    case x => falseSeq.updated(x, true)
  }
})

val prioColumns = Seq("priority1", "priority2", "priority3")
val df2 = df.withColumn("priorities", convertPriorities(array(prioColumns.map(col(_)):_*)))

Note that a prioColumns variable is created to make the foldLeft easier.

val df3 = prioColumns.zipWithIndex
  .foldLeft(df2)((df, col) => df.withColumn("new_" + col._1, $"priorities"(col._2)))
  .drop("priorities")

This will give the following resulting dataframe:

+---+---------+---------+---------+-------------+-------------+-------------+
| Id|priority1|priority2|priority3|new_priority1|new_priority2|new_priority3|
+---+---------+---------+---------+-------------+-------------+-------------+
|  1|     true|     true|     true|         true|        false|        false|
|  2|    false|     true|     true|        false|         true|        false|
|  3|    false|    false|    false|        false|        false|        false|
+---+---------+---------+---------+-------------+-------------+-------------+

Using this approach it should be very easy to extend to use more columns, the only change that needs to be done is to add the columns to the prioColumns variable. This could be done by getting all column name and applying a filter if (as in the example above) the columns have similar names.

Upvotes: 2

Related Questions