Reputation: 67

How to use regex_replace to replace special characters from a column in pyspark dataframe

There is a column batch in dataframe. It has values like '9%','$5', etc.

I need use regex_replace in a way that it removes the special characters from the above example and keep just the numeric part.

Examples like 9 and 5 replacing 9% and $5 respectively in the same column.

Upvotes: 3

Answers (3)

Reputation: 436

You can use this regex:

\W+

\W - matches any non-word character (equal to [^a-zA-Z0-9_])

Upvotes: 2

Reputation: 11244

What have you tried so far?

select regexp_replace("'$5','9%'","[^0-9A-Za-z]","")

Upvotes: 1

Reputation: 6218

df.withColumn("batch",regexp_replace(col("batch"), "/[^0-9]+/", ""))

Upvotes: 7