quervernetzt
quervernetzt

Reputation: 11621

pyspark: Remove substring that is the value of another column and includes regex characters from the value of a given column

Let's say I have a Dataframe like

df = spark.createDataFrame(
  [
    ('Test1 This is a test Test2','This is a test'),
    ('That is','That')
  ],
  ['text','name'])


+--------------------------+--------------+
|text                      |name          |
+--------------------------+--------------+
|Test1 This is a test Test2|This is a test|
|That is                   |That          |
+--------------------------+--------------+

If I apply df.withColumn("new",F.expr("regexp_replace(text,name,'')")).show(truncate=False) it works fine and results in

+--------------------------+--------------+------------+
|text                      |name          |new         |
+--------------------------+--------------+------------+
|Test1 This is a test Test2|This is a test|Test1  Test2|
|That is                   |That          | is         |
+--------------------------+--------------+------------+

So let's say I have the following Dataframe

+-----------------------------+-----------------+
|text                         |name             |
+-----------------------------+-----------------+
|Test1 This is a test(+1 Test2|This is a test(+1|
|That is                      |That             |
+-----------------------------+-----------------+

If I apply the the command from above I get the following error message:

java.util.regex.PatternSyntaxException: Dangling meta character '+&#39

What can I do so that this exception does not occur in the most "pyspark" way and keeping the value in text as is?

Thanks

Upvotes: 3

Views: 1347

Answers (1)

notNull
notNull

Reputation: 31460

Instead of regexp_replace use replace function in spark.

replace(str, search[, replace]) - Replaces all occurrences of search with replace.

Example:

df.show(10,False)
#+-----------------------------+-----------------+
#|text                         |name             |
#+-----------------------------+-----------------+
#|Test1 This is a test(+1 Test2|This is a test(+1|
#|That is                      |That             |
#+-----------------------------+-----------------+

df.withColumn("new",expr("replace(text,name,'')")).show(10,False)
#+-----------------------------+-----------------+------------+
#|text                         |name             |new         |
#+-----------------------------+-----------------+------------+
#|Test1 This is a test(+1 Test2|This is a test(+1|Test1  Test2|
#|That is                      |That             | is         |
#+-----------------------------+-----------------+------------+

Upvotes: 4

Related Questions