Reputation: 165
I am trying to convert the following SQL query into pyspark:
SELECT COUNT( CASE WHEN COALESCE(data.pred,0) != 0 AND COALESCE(data.val,0) != 0
AND (ABS(COALESCE(data.pred,0) - COALESCE(data.val,0)) / COALESCE(data.val,0)) > 0.1
THEN data.pred END) / COUNT(*) AS Result
The code I have in PySpark right now is this:
Result = data.select(
count(
(coalesce(data["pred"], lit(0)) != 0) &
(coalesce(data["val"], lit(0)) != 0) &
(abs(
coalesce(data["pred"], lit(0)) -
coalesce(data["val"], lit(0))
) / coalesce(data["val"], lit(0)) > 0.1
)
)
)
aux_2 = aux_1.select(aux_1.column_name.cast("float"))
aux_3 = aux_2.head()[0]
Deviation = (aux_3 / data.count())*100
However, this is simply returning the number of rows in the "data" dataframe, and I know this isn't correct. I am very new at PySpark, can anyone help me solve this?
Upvotes: 0
Views: 796
Reputation: 42342
You need to collect
the result into an integer, and then divide the numbers in Python:
Result = data.filter(
(coalesce(data["pred"], lit(0)) != 0) &
(coalesce(data["val"], lit(0)) != 0) &
(abs(
coalesce(data["pred"], lit(0)) -
coalesce(data["val"], lit(0))
) / coalesce(data["val"], lit(0)) > 0.1
)
).count() / data.count()
Upvotes: 1