Reputation:
I have a dataframe :
+------------------+
| speed |
+------------------+
| 0.0|
| 0.0|
| 0.0|
| 0.0|
| 0.0|
| 0.0|
| 3.851015222867941|
| 4.456657435740331|
| 0.0|
| NaN|
| 0.0|
| 0.0|
| NaN|
| 0.0|
| 0.0|
| 5.424094717765175|
|1.5781185921913181|
|2.6695439462433033|
| 17.43513658955467|
| 5.440912941359523|
|11.507138536880484|
|12.895677610360089|
| 9.930875909722456|
+------------------+
I want to calculate the mean and the standard deviation of speed column . When I execute this code
dataframe_final.select("speed").orderBy("id").agg(avg("speed")).show(1000)
I get
+------------+
|avg(speed)|
+------------+
| NaN|
+------------+
Where does the problem comes from ? any posibility to solve it ?
Thanks
Upvotes: 0
Views: 1047
Reputation: 92
we can also createOrReplaceTempView(dataframe_final) and then we can use spark sql to query and take avg of the speed column
val tableview= dataframe_final.createOrReplaceTempView()
val query = select avg(speed) from tableview where speed IS NOT NULL order by Id
spark.sql(query).show()
Upvotes: 1
Reputation: 163
You have NaN
(Not a Number) values in your dataset. You cannot calculate an average with those.
Either you filter them:
dataframe_final
.filter($"speed".isNotNull())
.select("speed")
.orderBy("id")
.agg(avg("speed"))
.show(1000)
Or replace them with a 0
using the fill
function:
dataframe_final
.select("speed")
.na.fill(0)
.agg(avg("speed"))
.show(1000)
Additionally you are trying to aggregate the Vitesse
column and not the speed
.
Upvotes: 5