nmr
nmr

Reputation: 753

round to precision value based on another column pyspark

I have a data frame like below in pyspark

import pyspark.sql.functions as f

df = spark.createDataFrame(
[(123, 2897402, 43.25, 2),
(124, 2897402, 49.11, 0),
(125, 2897402, 43.25, 2), 
(126, 2897402, 48.75, 0)]
, ['model_id','lab_test_id','summary_measure_value','reading_precision'])

Expected output:

+--------+-----------+---------------------+-----------------+-------------+
|model_id|lab_test_id|summary_measure_value|reading_precision|reading_value|
+--------+-----------+---------------------+-----------------+-------------+
|     123|    2897402|                43.25|                2|        43.25|
|     124|    2897402|                49.11|                1|         49.1|
|     125|    2897402|                43.25|                2|        43.25|
|     126|    2897402|                48.75|                0|         49.0|
+--------+-----------+---------------------+-----------------+-------------+

I have tried like below

df1 = df.withColumn("reading_value", f.round(f.col("summary_measure_value"), f.col("reading_precision")))

I am getting Column is not iterable error.

How can I achieve what I want

Upvotes: 3

Views: 721

Answers (2)

Oli
Oli

Reputation: 10406

Unfortunately, the round function has the following signature:

def round(e: Column, scale: Int): Column

Therefore you can only round columns with a fixed precision determined in the driver.

To solve that, you could use a UDF, but in pyspark, they are extremely expensive. Python is very slow compared to native spark code.

Therefore, you could build a custom rounding function using round and pow like this:

# This is not a UDF, just a construction based on spark functions
def round(column, precision):
    return f.round(column * pow(10, precision)) / f.pow(10, precision)

df.withColumn("reading_value", round(f.col("summary_measure_value"), f.col("reading_precision")))

Upvotes: 2

ggordon
ggordon

Reputation: 10035

You may try using a udf that uses python's built in round function to achieve this eg:

@f.udf
def udf_round(value,precision):
    try:
        precision = int(precision)
        value = float(value)
        # use python built-in round function to round values
        return round(value,precision)
    except:
        # decide what to return when you encounter bad data
        # in this example I've returned the original value
        return value

df=df.withColumn("reading_value",udf_round( f.col("summary_measure_value"),f.col("reading_precision") ))
df.show(truncate=False)

Outputs:

+--------+-----------+---------------------+-----------------+-------------+
|model_id|lab_test_id|summary_measure_value|reading_precision|reading_value|
+--------+-----------+---------------------+-----------------+-------------+
|123     |2897402    |43.25                |2                |43.25        |
|124     |2897402    |49.25                |0                |49.0         |
|125     |2897402    |43.25                |2                |43.25        |
|126     |2897402    |48.75                |0                |49.0         |
+--------+-----------+---------------------+-----------------+-------------+

Upvotes: 1

Related Questions