Reputation: 29
I'd like to ask your help for transforming strings in one column in Pyspark dataframe.
For example, I have a dataframe named "df" which has the following structure.
df = spark.createDataFrame([('David','5K'),('William','6M'),('Sam','1B'),('Ashely','342 1'),('Chloe','240.5 4')], ['Name','Numbers'])
where K = thousands, M = millions, and B = billions.
What I want to do here is (1) converting the string from K to thousands, M to millions, B to billions, and (2) removing the spaces between strings in the "Numbers" column, and then (3) changing its datatype to double. You can use regular expressions but I don't want to use pandas in this case.
This would be the desired output after transformation:
Names |Numbers
---------------------------
David |5000
William |6000000
Sam |1000000000
Ashely |3421
Chloe |240.54
I'd appreciate any helps from you guys!
Upvotes: 0
Views: 628
Reputation: 3419
You can use regexp_replace
and then multiply the 'Numbers' column with the appropriate 10th power:
from pyspark.sql.types import *
from pyspark.sql import functions as F
from pyspark.sql.functions import col
from pyspark.sql.functions import *
df.withColumn("Numbers", when(col('Numbers').like("%K"), (regexp_replace('Numbers', 'K', '').cast('double')*1000))\
.when(col('Numbers').like("%M"), (regexp_replace('Numbers', 'M', '').cast('double')*1000000))\
.when(col('Numbers').like("%B"), (regexp_replace('Numbers', 'B', '').cast('double')*1000000000))\
.otherwise((regexp_replace('Numbers', ' ', '').cast('double'))))\
.show()
Output:
+-------+---------+
| Name| Numbers|
+-------+---------+
| David| 5000.0|
|William|6000000.0|
| Sam| 1.0E9|
| Ashely| 3421.0|
| Chloe| 240.54|
+-------+---------+
If we have 5.5K
, then the output will be 5500.0
Upvotes: 1