Regular expressions in Databricks Spark SQL

Question

Curious thing. Using Databricks, but getting two different behaviors when using the replace_regex functionality.

%sql
select upper(regexp_replace('Test (PA) (1234) ', '[(]\d+[)]', '')) as result

returns my expectation: TEST (PA)

Yet:

%python    
display(spark.sql("""select upper(regexp_replace('Test (PA) (1234) ', '[(]\d+[)]', '')) as result"""))

returns: TEST (PA) (1234)

Should these not be equivalent in the two different contexts (spark sql, and spark / python / pyspark)? Why doesn't the second context work?

Alex Ott · Accepted Answer

You don't take into account character escaping rules. In Python your \ is converted into a single \, so your regex changes to [(]\d+[)]. You have choice:

use raw string r"""select upper(regexp_replace('Test (PA) (1234) ', '[(]\d+[)]', '')) as result"""
double backslashes: [(]\\d+[)]

with any of these changes, it produces correct result

Regular expressions in Databricks Spark SQL

Answers (1)

Related Questions