Reputation: 434
How do I limit the number of digits after decimal point? I have a pyspark dataframe.
Month Month_start Month_end Result
2/1/2021 2349 456 515.131579086421
2/1/2021 3109 3500 88.8285714123568
3/1/2021 45.23 23.77 190.281868569833
3/1/2021 598412500 3796000.00 157.642913593256
I need to retain only 2 digits after the decimal point in the Result column and add % symbol for all values in Result column. Month, Month_start, Month_end columns are of string type and Result is of double datatype.
Here's what I tried
df=spark.read.csv(src_path, header=True, encoding='ISO-8859-1')
df=df.withColumn(format('Result', (col('Month_start')/col('Month_end')*100)),"%.2f").lit("%")
Im getting this error message
AttributeError: 'DataFrame' object has no attribute 'lit'
My expected dataframe is:
Month Month_start Month_end Result
2/1/2021 2349 456 515.13%
2/1/2021 3109 3500 88.82%
3/1/2021 45.23 23.77 190.28%
3/1/2021 598412500 3796000.00 157.64%
Upvotes: 1
Views: 12111
Reputation: 5487
You can cast column type to decimal with precision as 2 then concat with %
.
df = # Input dataframe
df.withColumn('Result', concat(((col('Month_start')/col('Month_end'))*100).cast(DecimalType(38,2)),lit('%'))).show()
+--------+-----------+----------+---------+
| Month|Month_start| Month_end| Result|
+--------+-----------+----------+---------+
|2/1/2021| 2349| 456| 515.13%|
|2/1/2021| 3109| 3500| 88.83%|
|3/1/2021| 45.23| 23.77| 190.28%|
|3/1/2021| 598412500|3796000.00|15764.29%|
+--------+-----------+----------+---------+
Upvotes: 2
Reputation: 5032
You can wrap the calculation within Round to limit the decimal places
input_str = """
2/1/2021 2349 456 515.131579086421
2/1/2021 3109 3500 88.8285714123568
3/1/2021 45.23 23.77 190.281868569833
3/1/2021 598412500 3796000.00 157.642913593256
""".split()
input_values = list(map(lambda x: x.strip() if x.strip() != 'null' else None, input_str))
cols = list(map(lambda x: x.strip() if x.strip() != 'null' else None, "Month Month_start Month_end Result".split()))
n = len(input_values)
input_list = [tuple(input_values[i:i+4]) for i in range(0,n,4)]
sparkDF = sql.createDataFrame(input_list, cols)
sparkDF = sparkDF.withColumn('Month_start',F.col('Month_start').cast(DoubleType()))\
.withColumn('Month_end',F.col('Month_end').cast(DoubleType()))
sparkDF.show()
+--------+-----------+---------+----------------+
| Month|Month_start|Month_end| Result|
+--------+-----------+---------+----------------+
|2/1/2021| 2349.0| 456.0|515.131579086421|
|2/1/2021| 3109.0| 3500.0|88.8285714123568|
|3/1/2021| 45.23| 23.77|190.281868569833|
|3/1/2021| 5.984125E8|3796000.0|157.642913593256|
+--------+-----------+---------+----------------+
import pyspark.sql.functions as F
sparkDF = sparkDF.withColumn('rounded_result',F.round((F.col('Month_start')/F.col('Month_end'))*100,2))
sparkDF = sparkDF.withColumn('rounded_result',F.concat(F.col('rounded_result').cast(StringType()),F.lit('%')))
sparkDF.show()
+--------+-----------+---------+----------------+--------------+
| Month|Month_start|Month_end| Result|rounded_result|
+--------+-----------+---------+----------------+--------------+
|2/1/2021| 2349.0| 456.0|515.131579086421| 515.13%|
|2/1/2021| 3109.0| 3500.0|88.8285714123568| 88.83%|
|3/1/2021| 45.23| 23.77|190.281868569833| 190.28%|
|3/1/2021| 5.984125E8|3796000.0|157.642913593256| 15764.29%|
+--------+-----------+---------+----------------+--------------+
sparkDF.printSchema()
root
|-- Month: string (nullable = true)
|-- Month_start: double (nullable = true)
|-- Month_end: double (nullable = true)
|-- Result: string (nullable = true)
|-- rounded_result: string (nullable = true)
Upvotes: 4