Fisseha Berhane
Fisseha Berhane

Reputation: 2653

Replace more than one element in Pyspark

I want to replace parts of a string in Pyspark using regexp_replace such as 'www.' and '.com'. Is it possible to pass list of elements to be replaced?

my_list = ['www.google.com', 'google.com','www.goole']
from pyspark.sql import Row
from pyspark.sql.functions import regexp_replace
df = sc.parallelize(my_list).map(lambda x: Row(url = x)).toDF()
df.withColumn('site', regexp_replace('url', 'www.', '')).show()

I want to replace both www. and .com in the above example

Upvotes: 4

Views: 5582

Answers (1)

akuiper
akuiper

Reputation: 214957

Use a pipe | (OR) to combine the two patterns into a single regex pattern www\.|\.com, which will match www. or .com, notice you need to escape . to match it literally since . matches (almost) any character in regex:

df.withColumn('site', regexp_replace('url', 'www\.|\.com', '')).show()
+--------------+------+
|           url|  site|
+--------------+------+
|www.google.com|google|
|    google.com|google|
|     www.goole| goole|
+--------------+------+

Upvotes: 5

Related Questions