harumomo503
harumomo503

Reputation: 411

sanitize column values in pyspark dataframe

Given CSV file, I converted to Dataframe using something code like the following.

raw_df = spark.read.csv(input_data, header=True)

That creates dataframe looks something like this:

| Name |
========
|  23  |
|  hi2 |
|  me3 |
|  do  |

I want to convert this column to only contain numbers. The final result should be like where hi and me are removed:

| Name |
========
|  23  |
|   2  |
|   3  |
|  do  |

I want to sanitize the values and make sure it only contains number. But I'm not sure if it's possible in Spark.

Upvotes: 2

Views: 1068

Answers (2)

vikrant rana
vikrant rana

Reputation: 4674

Otherway doing the same. It's just an another way but better use spark inbuilt functions if available. as shown above also.

from pyspark.sql.functions import udf
import re
user_func =  udf (lambda x: re.findall("\d+", x)[0])
newdf = df.withColumn('new_column',user_func(df.Name))

>>> newdf.show()
+----+----------+
|Name|new_column|
+----+----------+
|  23|        23|
| hi2|         2|
| me3|         3|
+----+----------+

Upvotes: 1

SMaZ
SMaZ

Reputation: 2655

Yes, It's possible. You can use regex_replace from function.

Please check this:

import pyspark.sql.functions as f

df = spark.sparkContext.parallelize([('12',), ('hi2',), ('me3',)]).toDF(["name"])

df.show()
+----+
|name|
+----+
|  12|
| hi2|
| me3|
+----+

final_df = df.withColumn('sanitize', f.regexp_replace('name', '[a-zA-Z]', '')) 

final_df.show()
+----+--------+
|name|sanitize|
+----+--------+
|  12|      12|
| hi2|       2|
| me3|       3|
+----+--------+

final_df.withColumn('len', f.length('sanitize')).show()
+----+--------+---+
|name|sanitize|len|
+----+--------+---+
|  12|      12|  2|
| hi2|       2|  1|
| me3|       3|  1|
+----+--------+---+

You can adjust regex.

Upvotes: 1

Related Questions