describe() function over rows instead columns

Question

As said in: https://databricks.com/blog/2015/06/02/statistical-and-mathematical-functions-with-dataframes-in-spark.html

The describe() function works for each numerical column, It is possible to do it against rows? My DF size is 53 cols and 346,143 rows, so transpose is not an option. How can I do it?

I'm using Spark 2.11

Raphael Roth · Accepted Answer

You can do your own UDF. Either you make an separate UDF for each quantity or put everything in 1 UDF returning a complex result:

val df = Seq(
  (1.0,2.0,3.0,4.0,5.0)  
).toDF("x1","x2","x3","x4","x5")


val describe = udf(
  { xs : Seq[Double] => 

    val xmin = xs.min
    val xmax = xs.max
    val mean = xs.sum/xs.size.toDouble

    (xmin,xmax,mean)
  }
)

df
.withColumn("describe",describe(array("*")))
.withColumn("min",$"describe._1")
.withColumn("max",$"describe._2")
.withColumn("mean",$"describe._3")
.drop($"describe")
.show

gives:

+---+---+---+---+---+---+---+----+
| x1| x2| x3| x4| x5|min|max|mean|
+---+---+---+---+---+---+---+----+
|1.0|2.0|3.0|4.0|5.0|1.0|5.0| 3.0|
+---+---+---+---+---+---+---+----+

describe() function over rows instead columns

Answers (1)

Related Questions