Reputation: 2653
I want to replace parts of a string in Pyspark using regexp_replace such as 'www.' and '.com'. Is it possible to pass list of elements to be replaced?
my_list = ['www.google.com', 'google.com','www.goole']
from pyspark.sql import Row
from pyspark.sql.functions import regexp_replace
df = sc.parallelize(my_list).map(lambda x: Row(url = x)).toDF()
df.withColumn('site', regexp_replace('url', 'www.', '')).show()
I want to replace both www. and .com in the above example
Upvotes: 4
Views: 5582
Reputation: 214957
Use a pipe |
(OR) to combine the two patterns into a single regex pattern www\.|\.com
, which will match www.
or .com
, notice you need to escape .
to match it literally since .
matches (almost) any character in regex:
df.withColumn('site', regexp_replace('url', 'www\.|\.com', '')).show()
+--------------+------+
| url| site|
+--------------+------+
|www.google.com|google|
| google.com|google|
| www.goole| goole|
+--------------+------+
Upvotes: 5