Reputation: 33
I have a dataframe and I'm doing this:
df = dataframe.withColumn("test", lit(0.4219759403))
I want to get just the first four numbers after the dot, without rounding.
When I cast to DecimalType, with .cast(DataTypes.createDecimalType(20,4)
or even with round
function, this number is rounded to 0.4220
.
The only way that I found without rounding is applying the function format_number()
, but this function gives me a string, and when I cast this string to DecimalType(20,4)
, the framework rounds the number again to 0.4220
.
I need to convert this number to DecimalType(20,4)
without rounding, and I expect to see 0.4219
.
Upvotes: 1
Views: 3058
Reputation: 453
Hi welcome to stackoverflow,
please next time try to provide a reproducible example with the code you tried, anyways this works for me:
from pyspark.sql.types import DecimalType
df = spark.createDataFrame([
(1, "a"),
(2, "b"),
(3, "c"),
], ["ID", "Text"])
df = df.withColumn("test", lit(0.4219759403))
df = df.withColumn("test_string", F.substring(df["test"].cast("string"), 0, 6))
df = df.withColumn("test_string_decimaltype", df["test_string"].cast(DecimalType(20,4)))
df.show()
df.printSchema()
+---+----+------------+-----------+-----------------------+
| ID|Text| test|test_string|test_string_decimaltype|
+---+----+------------+-----------+-----------------------+
| 1| a|0.4219759403| 0.4219| 0.4219|
| 2| b|0.4219759403| 0.4219| 0.4219|
| 3| c|0.4219759403| 0.4219| 0.4219|
+---+----+------------+-----------+-----------------------+
root
|-- ID: long (nullable = true)
|-- Text: string (nullable = true)
|-- test: double (nullable = false)
|-- test_string: string (nullable = false)
|-- test_string_decimaltype: decimal(20,4) (nullable = true)
Of course if you want you can overwrite the same column by putting always "test", i choose different names to let you see the steps.
Upvotes: -1
Reputation: 9247
If you have numbers with more than 1 digit before the decimal point, the substr
is not adapt. Instead, you can use a regex to always extract the first 4 decimal digits (if present).
You can do this using regexp_extract
df = dataframe.withColumn('rounded', F.regexp_extract(F.col('test'), '\d+\.\d{0,4}', 0))
Example
import pyspark.sql.functions as F
dataframe = spark.createDataFrame([
(0.4219759403, ),
(0.4, ),
(1.0, ),
(0.5431293, ),
(123.769859, )
], ['test'])
df = dataframe.withColumn('rounded', F.regexp_extract(F.col('test'), '\d+\.\d{0,4}', 0))
df.show()
+------------+--------+
| test| rounded|
+------------+--------+
|0.4219759403| 0.4219|
| 0.4| 0.4|
| 1.0| 1.0|
| 0.5431293| 0.5431|
| 123.769859|123.7698|
+------------+--------+
Upvotes: 3