Calculating mean and standard deviation using Spark / SCALA

Question

I have a dataframe :

+------------------+
|         speed    |
+------------------+
|               0.0|
|               0.0|
|               0.0|
|               0.0|
|               0.0|
|               0.0|
| 3.851015222867941|
| 4.456657435740331|
|               0.0|
|               NaN|
|               0.0|
|               0.0|
|               NaN|
|               0.0|
|               0.0|
| 5.424094717765175|
|1.5781185921913181|
|2.6695439462433033|
| 17.43513658955467|
| 5.440912941359523|
|11.507138536880484|
|12.895677610360089|
| 9.930875909722456|
+------------------+

I want to calculate the mean and the standard deviation of speed column . When I execute this code

dataframe_final.select("speed").orderBy("id").agg(avg("speed")).show(1000)

I get

+------------+
|avg(speed)|
+------------+
|         NaN|
+------------+

Where does the problem comes from ? any posibility to solve it ?

Thanks

nathan_gs · Accepted Answer

You have NaN (Not a Number) values in your dataset. You cannot calculate an average with those.

Either you filter them:


dataframe_final
  .filter($"speed".isNotNull())
  .select("speed")
  .orderBy("id")
  .agg(avg("speed"))
  .show(1000)

Or replace them with a 0 using the fill function:

dataframe_final
  .select("speed")
  .na.fill(0)
  .agg(avg("speed"))
  .show(1000)

Additionally you are trying to aggregate the Vitesse column and not the speed.

Calculating mean and standard deviation using Spark / SCALA

Answers (2)

Related Questions