Ivan
Ivan

Reputation: 703

How to round decimal in Scala Spark

I have a (large ~ 1million) Scala Spark DataFrame with the following data:

id,score
1,0.956
2,0.977
3,0.855
4,0.866
...

How do I discretise/round the scores to the nearest 0.05 decimal place?

Expected result:

id,score
1,0.95
2,1.00
3,0.85
4,0.85
...

Would like to avoid using UDF to maximise performance.

Upvotes: 6

Views: 54644

Answers (3)

irisha_murrr
irisha_murrr

Reputation: 395

The answer can be simplifier:

dataframe.withColumn("rounded_score", round(col("score"), 2))

there is a method

def round(e: Column, scale: Int)

Round the value of e to scale decimal places with HALF_UP round mode

Upvotes: 18

vaquar khan
vaquar khan

Reputation: 11489

You can specify your schema when convert into dataframe ,

Example :

DecimalType(10, 2) for the column in your customSchema when loading data.

id,score
1,0.956
2,0.977
3,0.855
4,0.866
...



import org.apache.spark.sql.types._

val mySchema = StructType(Array(
  StructField("id", IntegerType, true),
   StructField("score", DecimalType(10, 2), true)
))

spark.read.format("csv").schema(mySchema).
  option("header", "true").option("nullvalue", "?").
  load("/path/to/csvfile").show

Upvotes: 1

soote
soote

Reputation: 3260

You can do it using spark built in functions like so

dataframe.withColumn("rounded_score", round(col("score") * 100 / 5) * 5 / 100)
  1. Multiply it so that the precision you want is a whole number.
  2. Then divide that number by 5, and round.
  3. Now the number is divisable by 5, so multiply it by 5 to get back the entire number
  4. Divide by 100 to get the precision correct again.

result

+---+-----+-------------+
| id|score|rounded_score|
+---+-----+-------------+
|  1|0.956|         0.95|
|  2|0.977|          1.0|
|  3|0.855|         0.85|
|  4|0.866|         0.85|
+---+-----+-------------+

Upvotes: 11

Related Questions