round to precision value based on another column pyspark

Question

I have a data frame like below in pyspark

import pyspark.sql.functions as f

df = spark.createDataFrame(
[(123, 2897402, 43.25, 2),
(124, 2897402, 49.11, 0),
(125, 2897402, 43.25, 2), 
(126, 2897402, 48.75, 0)]
, ['model_id','lab_test_id','summary_measure_value','reading_precision'])

Expected output:

+--------+-----------+---------------------+-----------------+-------------+
|model_id|lab_test_id|summary_measure_value|reading_precision|reading_value|
+--------+-----------+---------------------+-----------------+-------------+
|     123|    2897402|                43.25|                2|        43.25|
|     124|    2897402|                49.11|                1|         49.1|
|     125|    2897402|                43.25|                2|        43.25|
|     126|    2897402|                48.75|                0|         49.0|
+--------+-----------+---------------------+-----------------+-------------+

I have tried like below

df1 = df.withColumn("reading_value", f.round(f.col("summary_measure_value"), f.col("reading_precision")))

I am getting Column is not iterable error.

How can I achieve what I want

Oli · Accepted Answer

Unfortunately, the round function has the following signature:

def round(e: Column, scale: Int): Column

Therefore you can only round columns with a fixed precision determined in the driver.

To solve that, you could use a UDF, but in pyspark, they are extremely expensive. Python is very slow compared to native spark code.

Therefore, you could build a custom rounding function using round and pow like this:

# This is not a UDF, just a construction based on spark functions
def round(column, precision):
    return f.round(column * pow(10, precision)) / f.pow(10, precision)

df.withColumn("reading_value", round(f.col("summary_measure_value"), f.col("reading_precision")))

round to precision value based on another column pyspark

Answers (2)

Related Questions