Hari Jinn
Hari Jinn

Reputation: 29

Convert the string from K to thousands etc., and remove space between strings in pyspark dataframe

I'd like to ask your help for transforming strings in one column in Pyspark dataframe.

For example, I have a dataframe named "df" which has the following structure.

df = spark.createDataFrame([('David','5K'),('William','6M'),('Sam','1B'),('Ashely','342 1'),('Chloe','240.5 4')], ['Name','Numbers'])

where K = thousands, M = millions, and B = billions.

What I want to do here is (1) converting the string from K to thousands, M to millions, B to billions, and (2) removing the spaces between strings in the "Numbers" column, and then (3) changing its datatype to double. You can use regular expressions but I don't want to use pandas in this case.

This would be the desired output after transformation:

Names    |Numbers
---------------------------
David    |5000
William  |6000000
Sam      |1000000000
Ashely   |3421
Chloe    |240.54

I'd appreciate any helps from you guys!

Upvotes: 0

Views: 628

Answers (1)

Cena
Cena

Reputation: 3419

You can use regexp_replace and then multiply the 'Numbers' column with the appropriate 10th power:

from pyspark.sql.types import *
from pyspark.sql import functions as F
from pyspark.sql.functions import col
from pyspark.sql.functions import *

df.withColumn("Numbers", when(col('Numbers').like("%K"), (regexp_replace('Numbers', 'K', '').cast('double')*1000))\
    .when(col('Numbers').like("%M"), (regexp_replace('Numbers', 'M', '').cast('double')*1000000))\
    .when(col('Numbers').like("%B"), (regexp_replace('Numbers', 'B', '').cast('double')*1000000000))\
    .otherwise((regexp_replace('Numbers', ' ', '').cast('double'))))\
    .show()

Output:

+-------+---------+
|   Name|  Numbers|
+-------+---------+
|  David|   5000.0|
|William|6000000.0|
|    Sam|    1.0E9|
| Ashely|   3421.0|
|  Chloe|   240.54|
+-------+---------+

If we have 5.5K, then the output will be 5500.0

Upvotes: 1

Related Questions