Reputation: 21
I'm working on Pyspark, created a sample dataframe with some long and decimal type columns. Here I wanted to fetch decimal type column value to two decimal points without rounding. Below is the code I tried.
df = spark.createDataFrame([
(324.456, "hi", "test"),
(453.987, "hello", "python"),
(768.66, "test", "java")
], ["col1", "col2", "col3"]
)
new = df.withColumn(
"col4",
F.substring((df.col1).cast(StringType()),1,F.instr((df.col1).cast(StringType()),".")+2))
So here I'm converting the column into string and finding he index position adding two (because I need two decimal points without rounding). But I don't what's the mistake here I'm getting Column
object is not callable error. If I'm using only F.instr() function it is working fine. Kindly help with my other solution two fetch the value to two decimals without rounding.
Expected output
col1 col2 col3 col4
324.456 hi test 324.45
453.987 hello python 453.98
768.66 test java 768.66
Upvotes: 2
Views: 335
Reputation: 75100
You can also use a regular expression with regexp_extract
here:
df.withColumn('test',
F.regexp_extract(F.col("col1").cast("string"),'\d+[.]\d{2}',0)).show()
Or as @MohammadMurtazaHashmi sugested in comments no casting required:
df.withColumn('test',F.regexp_extract(F.col("col1"),'\d+[.]\d{2}',0)).show()
+-------+-----+------+------+
| col1| col2| col3| test|
+-------+-----+------+------+
|324.456| hi| test|324.45|
|453.987|hello|python|453.98|
| 768.66| test| java|768.66|
+-------+-----+------+------+
Upvotes: 2
Reputation: 1962
What you're looking for is a way of truncating decimals. I propose you use pyspark.sql.functions.pow
and some clever use of casting to LongType
for this. This way, you multiply by 10^{decimal_places}
and divide by the same again, while casting to long
to get rid of decimals (floats) in between, such as:
df2.show()
+-------+-----+------+
| col1| col2| col3|
+-------+-----+------+
|324.456| hi| test|
|453.987|hello|python|
| 768.66| test| java|
+-------+-----+------+
decimal_places = 2
truncated_value_column = f.pow(f.lit(10), decimal_places).cast('long')
df2.withColumn(
"trunc",
((f.col("col1") * truncated_value_column)).cast("long") / truncated_value_column
).show()
+-------+-----+------+------+
| col1| col2| col3| trunc|
+-------+-----+------+------+
|324.456| hi| test|324.45|
|453.987|hello|python|453.98|
| 768.66| test| java|768.66|
+-------+-----+------+------+
NB: If you then wish to cast back to string
, I recommend you do so afterwards. Hope this helps!
Upvotes: 1