Pyspark handling number with comma to decimal

Question

I have the following column that need to be transformerd into a decimal. Currently the column ent_Rentabiliteit_ent_rentabiliteit is a string and I need to transform to a data type which returns the same values. So after transformation values such as -0.7 or -1.2 must be showed.

My current code looks currently like:

from pyspark.sql import functions as F

df2test = df2.withColumn(
    "ent_Rentabiliteit_ent_rentabiliteit ",
    F.regexp_replace("ent_Rentabiliteit_ent_rentabiliteit ", ".", ".").cast(
        DecimalType()
    ),
)

Any suggestions? many thanks

vladsiv · Accepted Answer

You could just cast the column to DecimalType.

Example:

spark = SparkSession.builder.getOrCreate()

data = [
    {
        "ent_Rentabiliteit_ent_rentabiliteit": "1.23",
    },
    {
        "ent_Rentabiliteit_ent_rentabiliteit": "-2.33",
    },
]
df = spark.createDataFrame(data)
df = df.withColumn(
    "new_col", F.col("ent_Rentabiliteit_ent_rentabiliteit").cast(DecimalType(10, 4))
)

Result:

+-----------------------------------+-------+                                   
|ent_Rentabiliteit_ent_rentabiliteit|new_col|
+-----------------------------------+-------+
|                               1.23| 1.2300|
|                              -2.33|-2.3300|
+-----------------------------------+-------+

Schema:

root
 |-- ent_Rentabiliteit_ent_rentabiliteit: string (nullable = true)
 |-- new_col: decimal(10,4) (nullable = true)

Please keep in mind that DecimalType must have fixed precision. When creating a DecimalType, the default precision and scale is (10, 0). When inferring schema from decimal.Decimal objects, it will be DecimalType(38, 18).

Please read more about precision here: DecimalType

Pyspark handling number with comma to decimal

Answers (1)

Related Questions