user175025
user175025

Reputation: 434

How do I limit the number of digits after decimal point in pyspark dataframe

How do I limit the number of digits after decimal point? I have a pyspark dataframe.

Month       Month_start     Month_end       Result
2/1/2021    2349            456             515.131579086421
2/1/2021    3109            3500            88.8285714123568
3/1/2021    45.23           23.77           190.281868569833
3/1/2021    598412500       3796000.00      157.642913593256

I need to retain only 2 digits after the decimal point in the Result column and add % symbol for all values in Result column. Month, Month_start, Month_end columns are of string type and Result is of double datatype.

Here's what I tried

df=spark.read.csv(src_path, header=True, encoding='ISO-8859-1')

df=df.withColumn(format('Result', (col('Month_start')/col('Month_end')*100)),"%.2f").lit("%")

Im getting this error message AttributeError: 'DataFrame' object has no attribute 'lit'

My expected dataframe is:

Month       Month_start     Month_end       Result
2/1/2021    2349            456             515.13%
2/1/2021    3109            3500            88.82%
3/1/2021    45.23           23.77           190.28%
3/1/2021    598412500       3796000.00      157.64%

Upvotes: 1

Views: 12111

Answers (2)

Mohana B C
Mohana B C

Reputation: 5487

You can cast column type to decimal with precision as 2 then concat with %.

df = # Input dataframe
df.withColumn('Result', concat(((col('Month_start')/col('Month_end'))*100).cast(DecimalType(38,2)),lit('%'))).show()

+--------+-----------+----------+---------+
|   Month|Month_start| Month_end|   Result|
+--------+-----------+----------+---------+
|2/1/2021|       2349|       456|  515.13%|
|2/1/2021|       3109|      3500|   88.83%|
|3/1/2021|      45.23|     23.77|  190.28%|
|3/1/2021|  598412500|3796000.00|15764.29%|
+--------+-----------+----------+---------+

Upvotes: 2

Vaebhav
Vaebhav

Reputation: 5032

You can wrap the calculation within Round to limit the decimal places

Data Preparation

input_str = """
2/1/2021    2349            456             515.131579086421
2/1/2021    3109            3500            88.8285714123568
3/1/2021    45.23           23.77           190.281868569833
3/1/2021    598412500       3796000.00      157.642913593256
""".split()

input_values = list(map(lambda x: x.strip() if x.strip() != 'null' else None, input_str))

cols = list(map(lambda x: x.strip() if x.strip() != 'null' else None, "Month       Month_start     Month_end       Result".split()))
            
n = len(input_values)

input_list = [tuple(input_values[i:i+4]) for i in range(0,n,4)]

sparkDF = sql.createDataFrame(input_list, cols)

sparkDF = sparkDF.withColumn('Month_start',F.col('Month_start').cast(DoubleType()))\
                 .withColumn('Month_end',F.col('Month_end').cast(DoubleType()))

sparkDF.show()

+--------+-----------+---------+----------------+
|   Month|Month_start|Month_end|          Result|
+--------+-----------+---------+----------------+
|2/1/2021|     2349.0|    456.0|515.131579086421|
|2/1/2021|     3109.0|   3500.0|88.8285714123568|
|3/1/2021|      45.23|    23.77|190.281868569833|
|3/1/2021| 5.984125E8|3796000.0|157.642913593256|
+--------+-----------+---------+----------------+

Round

import pyspark.sql.functions as F

sparkDF = sparkDF.withColumn('rounded_result',F.round((F.col('Month_start')/F.col('Month_end'))*100,2))

sparkDF = sparkDF.withColumn('rounded_result',F.concat(F.col('rounded_result').cast(StringType()),F.lit('%')))

sparkDF.show()

+--------+-----------+---------+----------------+--------------+
|   Month|Month_start|Month_end|          Result|rounded_result|
+--------+-----------+---------+----------------+--------------+
|2/1/2021|     2349.0|    456.0|515.131579086421|       515.13%|
|2/1/2021|     3109.0|   3500.0|88.8285714123568|        88.83%|
|3/1/2021|      45.23|    23.77|190.281868569833|       190.28%|
|3/1/2021| 5.984125E8|3796000.0|157.642913593256|     15764.29%|
+--------+-----------+---------+----------------+--------------+

sparkDF.printSchema()

root
 |-- Month: string (nullable = true)
 |-- Month_start: double (nullable = true)
 |-- Month_end: double (nullable = true)
 |-- Result: string (nullable = true)
 |-- rounded_result: string (nullable = true)

Upvotes: 4

Related Questions