Reputation: 27
As said in: https://databricks.com/blog/2015/06/02/statistical-and-mathematical-functions-with-dataframes-in-spark.html
The describe()
function works for each numerical column, It is possible to do it against rows? My DF size is 53
cols and 346,143
rows, so transpose is not an option. How can I do it?
I'm using Spark 2.11
Upvotes: 0
Views: 1157
Reputation: 27373
You can do your own UDF. Either you make an separate UDF for each quantity or put everything in 1 UDF returning a complex result:
val df = Seq(
(1.0,2.0,3.0,4.0,5.0)
).toDF("x1","x2","x3","x4","x5")
val describe = udf(
{ xs : Seq[Double] =>
val xmin = xs.min
val xmax = xs.max
val mean = xs.sum/xs.size.toDouble
(xmin,xmax,mean)
}
)
df
.withColumn("describe",describe(array("*")))
.withColumn("min",$"describe._1")
.withColumn("max",$"describe._2")
.withColumn("mean",$"describe._3")
.drop($"describe")
.show
gives:
+---+---+---+---+---+---+---+----+
| x1| x2| x3| x4| x5|min|max|mean|
+---+---+---+---+---+---+---+----+
|1.0|2.0|3.0|4.0|5.0|1.0|5.0| 3.0|
+---+---+---+---+---+---+---+----+
Upvotes: 1