Pat Stroh
Pat Stroh

Reputation: 199

Regular expressions in Databricks Spark SQL

Curious thing. Using Databricks, but getting two different behaviors when using the replace_regex functionality.

%sql
select upper(regexp_replace('Test (PA) (1234) ', '[(]\\d+[)]', '')) as result

returns my expectation: TEST (PA)

Yet:

%python    
display(spark.sql("""select upper(regexp_replace('Test (PA) (1234) ', '[(]\\d+[)]', '')) as result"""))

returns: TEST (PA) (1234)

Should these not be equivalent in the two different contexts (spark sql, and spark / python / pyspark)? Why doesn't the second context work?

Upvotes: 1

Views: 4668

Answers (1)

Alex Ott
Alex Ott

Reputation: 87359

You don't take into account character escaping rules. In Python your \\ is converted into a single \, so your regex changes to [(]\d+[)]. You have choice:

  • use raw string r"""select upper(regexp_replace('Test (PA) (1234) ', '[(]\\d+[)]', '')) as result"""

  • double backslashes: [(]\\\\d+[)]

with any of these changes, it produces correct result

Upvotes: 2

Related Questions