Reputation: 3473
I just started learning Scala and I'm trying to figure out a way to get the min
of two or multiple Columns
of the same type in a DataFrame
. I have the following code which gives me the min
and max
of a Column
individually.
inputDF.select(min($"dropoff_longitude")).show
inputDF.select(max($"pickup_longitude")).show
How do I get the min
of both the Columns
, dropoff_longitude
and pickup_longitude
. I did it like this
scala.math.min(
inputDF.select(min($"pickup_longitude")).head.getFloat(0),
inputDF.select(min($"dropoff_longitude")).head.getFloat(0)
)
Is there a better way to do this?
Thank you
Upvotes: 4
Views: 18121
Reputation: 6164
You can use least
and greatest
Spark SQL functions in select expressions for this purpose. In your case it will look like this:
import org.apache.spark.sql.functions._
val minLongitude =
df.select(least($"pickup_longitude", $"dropoff_longitude") as "least_longitude")
.agg(min($"least_longitude"))
.head.getFloat(0)
Upvotes: 13
Reputation: 131
In many cases, it might be more efficient to avoid collect
so that a Spark query plan is actualized once (hence, fewer data reads and, possibly, less shuffling). In this case, you can use e.g.:
val minValDf = inputDf.agg(min(col(inputCol1)).as(inputCol1), min(col(inputCol2).as(inputCol2)))
val minDf = minValDf.withColumn("minCol",
when(col(inputCol1).gt(col(inputCol2)), col(inputCol2))
.otherwise(col(inputCol1)))
Upvotes: 4
Reputation: 214987
Think you can do something like this:
// assume all columns are of type Int, it could be other numeric type as well
val df = Seq((1,2),(3,4)).toDF("A", "B")
df.agg(min($"A"), min($"B")).first().toSeq.map{ case x: Int => x }.min
// res14: Int = 1
Upvotes: 1