Reputation: 753
I have a data frame like below in pyspark
import pyspark.sql.functions as f
df = spark.createDataFrame(
[(123, 2897402, 43.25, 2),
(124, 2897402, 49.11, 0),
(125, 2897402, 43.25, 2),
(126, 2897402, 48.75, 0)]
, ['model_id','lab_test_id','summary_measure_value','reading_precision'])
Expected output:
+--------+-----------+---------------------+-----------------+-------------+
|model_id|lab_test_id|summary_measure_value|reading_precision|reading_value|
+--------+-----------+---------------------+-----------------+-------------+
| 123| 2897402| 43.25| 2| 43.25|
| 124| 2897402| 49.11| 1| 49.1|
| 125| 2897402| 43.25| 2| 43.25|
| 126| 2897402| 48.75| 0| 49.0|
+--------+-----------+---------------------+-----------------+-------------+
I have tried like below
df1 = df.withColumn("reading_value", f.round(f.col("summary_measure_value"), f.col("reading_precision")))
I am getting Column is not iterable
error.
How can I achieve what I want
Upvotes: 3
Views: 721
Reputation: 10406
Unfortunately, the round
function has the following signature:
def round(e: Column, scale: Int): Column
Therefore you can only round columns with a fixed precision determined in the driver.
To solve that, you could use a UDF, but in pyspark, they are extremely expensive. Python is very slow compared to native spark code.
Therefore, you could build a custom rounding function using round
and pow
like this:
# This is not a UDF, just a construction based on spark functions
def round(column, precision):
return f.round(column * pow(10, precision)) / f.pow(10, precision)
df.withColumn("reading_value", round(f.col("summary_measure_value"), f.col("reading_precision")))
Upvotes: 2
Reputation: 10035
You may try using a udf
that uses python's built in round function to achieve this eg:
@f.udf
def udf_round(value,precision):
try:
precision = int(precision)
value = float(value)
# use python built-in round function to round values
return round(value,precision)
except:
# decide what to return when you encounter bad data
# in this example I've returned the original value
return value
df=df.withColumn("reading_value",udf_round( f.col("summary_measure_value"),f.col("reading_precision") ))
df.show(truncate=False)
Outputs:
+--------+-----------+---------------------+-----------------+-------------+
|model_id|lab_test_id|summary_measure_value|reading_precision|reading_value|
+--------+-----------+---------------------+-----------------+-------------+
|123 |2897402 |43.25 |2 |43.25 |
|124 |2897402 |49.25 |0 |49.0 |
|125 |2897402 |43.25 |2 |43.25 |
|126 |2897402 |48.75 |0 |49.0 |
+--------+-----------+---------------------+-----------------+-------------+
Upvotes: 1