SparkSQL Regular Expression: Cannot remove backslash from text

Question

I have data imbedded in my text fields that I need to suppress. The data is in the format of nn where nnn is a 3 digit number. I tried the following:

spark.sql("select regexp_replace('ABC\123XYZ\456','[\d][\d][\d]','') as new_value").show()

I expected the result to be "ABC\XYZ\", but what I got instead was:

+---------+
|new_value|
+---------+
| ABCSXYZĮ|
+---------+

I'm not sure what the other characters are after the C and after Z.

However I need to remove the backslash as well. To get rid of the backslash, I then tried this:

spark.sql("select regexp_replace('ABC\123XYZ\456','[\][\d][\d][\d]','') as new_value").show()

However it just simply crashed on me. No matter how I try to escape the backslash, it failed. I tested the regular expression on regex101.com and it was fine.

I appreciate any suggestions!

EDIT: I tried Daniel's suggestion but received the following errors:

>>> spark.sql(r"select regexp_replace(r'ABC\123XYZ\456',r'[\][\d][\d][\d]','') as new_value").show()
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/lib/spark/python/pyspark/sql/session.py", line 649, in sql
    return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
  File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
  File "/usr/lib/spark/python/pyspark/sql/utils.py", line 134, in deco
    raise_from(converted)
  File "", line 3, in raise_from
pyspark.sql.utils.ParseException:
Literals of type 'R' are currently not supported.(line 1, pos 22)

== SQL ==
select regexp_replace(r'ABC\123XYZ\456',r'[\][\d][\d][\d]','') as new_value
----------------------^^^

Thoughts?

danielcahall · Accepted Answer

If you check the string outside of the spark.sql call, you will notice the string has special characters due to how Python handles the backslash in a literal string:

>>> "select regexp_replace('ABC\123XYZ\456','[\][\d][\d][\d]','') as new_value"
"select regexp_replace('ABCSXYZĮ','[\][\d][\d][\d]','') as new_value"

If we add the r prefix to the string, it will be considered a "raw" string (see here), which treats backslashes as literal characters:

>>> r"select regexp_replace('ABC\123XYZ\456','[\][\d][\d][\d]','') as new_value"
"select regexp_replace('ABC\123XYZ\456','[\\][\d][\d][\d]','') as new_value"

That looks correct, but if we just supply the raw Python string to spark.sql, the exception SparkException: The Spark SQL phase optimization failed with an internal error. is raised. In addition to making the Python string a raw string, we also need to make the strings in the Spark SQL query raw, which can again be accomplished by providing the r prefix (see here) to the strings used within the query, to ensure backslashes are escaped correctly by the Spark SQL parser.

It's a bit odd looking, but the end result is:

spark.sql(r"select regexp_replace(r'ABC\123XYZ\456',r'[\][\d][\d][\d]','') as new_value").show()

And the output is, as expected:

+---------+
|new_value|
+---------+
|   ABCXYZ|
+---------+

It's also worth mentioning if you call regexp_replace as a function directly (from pyspark.sql.functions import regexp_replace), rather than within a Spark SQL query, the pattern in the regex seems to be implicitly treated as a raw string. It does not require an r prefix - the backslash in the first character class just needs to be escaped:

    from pyspark.sql import SparkSession
    from pyspark.sql.functions import regexp_replace
    spark = SparkSession\
        .builder\
        .appName("test")\
        .getOrCreate()
    df = spark.createDataFrame([(r'ABC\123XYZ\456',)], ['temp'])
    df.select(regexp_replace('temp', '[\\][\d][\d][\d]', '').alias('new_value')).show()

Output:

+---------+
|new_value|
+---------+
|   ABCXYZ|
+---------+

Update

It looks like when running on EMR, raw strings aren't supported in Spark SQL, which manifests in parsing errors. In that case, making the Python string a raw string and adding escape characters to the SQL string should also work:

    from pyspark.sql import SparkSession
    spark = SparkSession\
        .builder\
        .appName("test")\
        .getOrCreate()
    spark.sql(r"select regexp_replace('ABC\123XYZ\456','[\\][\\d][\\d][\\d]','') as new_value").show()

Output:

+---------+
|new_value|
+---------+
|   ABCXYZ|
+---------+

Note that the string the regex is being applied to only requires a single escape (\) for each \, whereas the pattern requires two escape characters (\) for each \.

SparkSQL Regular Expression: Cannot remove backslash from text

Answers (2)

Update

Related Questions