sanitize column values in pyspark dataframe

Question

Given CSV file, I converted to Dataframe using something code like the following.

raw_df = spark.read.csv(input_data, header=True)

That creates dataframe looks something like this:

| Name |
========
|  23  |
|  hi2 |
|  me3 |
|  do  |

I want to convert this column to only contain numbers. The final result should be like where hi and me are removed:

| Name |
========
|  23  |
|   2  |
|   3  |
|  do  |

I want to sanitize the values and make sure it only contains number. But I'm not sure if it's possible in Spark.

SMaZ · Accepted Answer

Yes, It's possible. You can use regex_replace from function.

Please check this:

import pyspark.sql.functions as f

df = spark.sparkContext.parallelize([('12',), ('hi2',), ('me3',)]).toDF(["name"])

df.show()
+----+
|name|
+----+
|  12|
| hi2|
| me3|
+----+

final_df = df.withColumn('sanitize', f.regexp_replace('name', '[a-zA-Z]', '')) 

final_df.show()
+----+--------+
|name|sanitize|
+----+--------+
|  12|      12|
| hi2|       2|
| me3|       3|
+----+--------+

final_df.withColumn('len', f.length('sanitize')).show()
+----+--------+---+
|name|sanitize|len|
+----+--------+---+
|  12|      12|  2|
| hi2|       2|  1|
| me3|       3|  1|
+----+--------+---+

You can adjust regex.

sanitize column values in pyspark dataframe

Answers (2)

Related Questions