Reputation: 703
I have a (large ~ 1million) Scala Spark DataFrame with the following data:
id,score
1,0.956
2,0.977
3,0.855
4,0.866
...
How do I discretise/round the scores to the nearest 0.05 decimal place?
Expected result:
id,score
1,0.95
2,1.00
3,0.85
4,0.85
...
Would like to avoid using UDF to maximise performance.
Upvotes: 6
Views: 54644
Reputation: 395
The answer can be simplifier:
dataframe.withColumn("rounded_score", round(col("score"), 2))
there is a method
def round(e: Column, scale: Int)
Round the value of
e
toscale
decimal places with HALF_UP round mode
Upvotes: 18
Reputation: 11489
You can specify your schema when convert into dataframe ,
Example :
DecimalType(10, 2) for the column in your customSchema when loading data.
id,score
1,0.956
2,0.977
3,0.855
4,0.866
...
import org.apache.spark.sql.types._
val mySchema = StructType(Array(
StructField("id", IntegerType, true),
StructField("score", DecimalType(10, 2), true)
))
spark.read.format("csv").schema(mySchema).
option("header", "true").option("nullvalue", "?").
load("/path/to/csvfile").show
Upvotes: 1
Reputation: 3260
You can do it using spark built in functions like so
dataframe.withColumn("rounded_score", round(col("score") * 100 / 5) * 5 / 100)
result
+---+-----+-------------+
| id|score|rounded_score|
+---+-----+-------------+
| 1|0.956| 0.95|
| 2|0.977| 1.0|
| 3|0.855| 0.85|
| 4|0.866| 0.85|
+---+-----+-------------+
Upvotes: 11