Joaquín Silva
Joaquín Silva

Reputation: 27

describe() function over rows instead columns

As said in: https://databricks.com/blog/2015/06/02/statistical-and-mathematical-functions-with-dataframes-in-spark.html

The describe() function works for each numerical column, It is possible to do it against rows? My DF size is 53 cols and 346,143 rows, so transpose is not an option. How can I do it?

I'm using Spark 2.11

Upvotes: 0

Views: 1157

Answers (1)

Raphael Roth
Raphael Roth

Reputation: 27373

You can do your own UDF. Either you make an separate UDF for each quantity or put everything in 1 UDF returning a complex result:

val df = Seq(
  (1.0,2.0,3.0,4.0,5.0)  
).toDF("x1","x2","x3","x4","x5")


val describe = udf(
  { xs : Seq[Double] => 

    val xmin = xs.min
    val xmax = xs.max
    val mean = xs.sum/xs.size.toDouble

    (xmin,xmax,mean)
  }
)

df
.withColumn("describe",describe(array("*")))
.withColumn("min",$"describe._1")
.withColumn("max",$"describe._2")
.withColumn("mean",$"describe._3")
.drop($"describe")
.show

gives:

+---+---+---+---+---+---+---+----+
| x1| x2| x3| x4| x5|min|max|mean|
+---+---+---+---+---+---+---+----+
|1.0|2.0|3.0|4.0|5.0|1.0|5.0| 3.0|
+---+---+---+---+---+---+---+----+

Upvotes: 1

Related Questions