hunter
hunter

Reputation: 21

Substring in python resulting column object being not callable

I'm working on Pyspark, created a sample dataframe with some long and decimal type columns. Here I wanted to fetch decimal type column value to two decimal points without rounding. Below is the code I tried.

df = spark.createDataFrame([
  (324.456, "hi", "test"),
  (453.987, "hello", "python"),
  (768.66, "test", "java")
  ], ["col1", "col2", "col3"]
)
new = df.withColumn(
 "col4",
 F.substring((df.col1).cast(StringType()),1,F.instr((df.col1).cast(StringType()),".")+2))

So here I'm converting the column into string and finding he index position adding two (because I need two decimal points without rounding). But I don't what's the mistake here I'm getting Column object is not callable error. If I'm using only F.instr() function it is working fine. Kindly help with my other solution two fetch the value to two decimals without rounding.

Expected output
col1     col2   col3   col4
324.456  hi     test   324.45
453.987  hello  python 453.98
768.66   test   java   768.66

Upvotes: 2

Views: 335

Answers (2)

anky
anky

Reputation: 75100

You can also use a regular expression with regexp_extract here:

df.withColumn('test',
              F.regexp_extract(F.col("col1").cast("string"),'\d+[.]\d{2}',0)).show()

Or as @MohammadMurtazaHashmi sugested in comments no casting required:

df.withColumn('test',F.regexp_extract(F.col("col1"),'\d+[.]\d{2}',0)).show()

+-------+-----+------+------+
|   col1| col2|  col3|  test|
+-------+-----+------+------+
|324.456|   hi|  test|324.45|
|453.987|hello|python|453.98|
| 768.66| test|  java|768.66|
+-------+-----+------+------+

Upvotes: 2

Napoleon Borntoparty
Napoleon Borntoparty

Reputation: 1962

What you're looking for is a way of truncating decimals. I propose you use pyspark.sql.functions.pow and some clever use of casting to LongType for this. This way, you multiply by 10^{decimal_places} and divide by the same again, while casting to long to get rid of decimals (floats) in between, such as:

df2.show()
+-------+-----+------+
|   col1| col2|  col3|
+-------+-----+------+
|324.456|   hi|  test|
|453.987|hello|python|
| 768.66| test|  java|
+-------+-----+------+


decimal_places = 2
truncated_value_column = f.pow(f.lit(10), decimal_places).cast('long')

df2.withColumn(
    "trunc", 
    ((f.col("col1") * truncated_value_column)).cast("long") / truncated_value_column
).show()
+-------+-----+------+------+
|   col1| col2|  col3| trunc|
+-------+-----+------+------+
|324.456|   hi|  test|324.45|
|453.987|hello|python|453.98|
| 768.66| test|  java|768.66|
+-------+-----+------+------+

NB: If you then wish to cast back to string, I recommend you do so afterwards. Hope this helps!

Upvotes: 1

Related Questions