SQL RV
SQL RV

Reputation: 97

SparkSQL Regular Expression: Cannot remove backslash from text

I have data imbedded in my text fields that I need to suppress. The data is in the format of \nnn where nnn is a 3 digit number. I tried the following:

spark.sql("select regexp_replace('ABC\123XYZ\456','[\d][\d][\d]','') as new_value").show()

I expected the result to be "ABC\XYZ\", but what I got instead was:

+---------+
|new_value|
+---------+
| ABCSXYZĮ|
+---------+

I'm not sure what the other characters are after the C and after Z.

However I need to remove the backslash as well. To get rid of the backslash, I then tried this:

spark.sql("select regexp_replace('ABC\123XYZ\456','[\\][\d][\d][\d]','') as new_value").show()

However it just simply crashed on me. No matter how I try to escape the backslash, it failed. I tested the regular expression on regex101.com and it was fine.

I appreciate any suggestions!

EDIT: I tried Daniel's suggestion but received the following errors:

>>> spark.sql(r"select regexp_replace(r'ABC\123XYZ\456',r'[\\][\d][\d][\d]','') as new_value").show()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/spark/python/pyspark/sql/session.py", line 649, in sql
    return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
  File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
  File "/usr/lib/spark/python/pyspark/sql/utils.py", line 134, in deco
    raise_from(converted)
  File "<string>", line 3, in raise_from
pyspark.sql.utils.ParseException:
Literals of type 'R' are currently not supported.(line 1, pos 22)

== SQL ==
select regexp_replace(r'ABC\123XYZ\456',r'[\\][\d][\d][\d]','') as new_value
----------------------^^^

Thoughts?

Upvotes: 0

Views: 885

Answers (2)

SQL RV
SQL RV

Reputation: 97

Daniel,

Thank you SO MUCH for your help! With the examples you gave, I was able to tweak my SQL statement so it would work. Besides putting the r before the select statement string, I had to change the regular expression to this:

'[\\\\][\\\d][\\\d][\\\d]'

Thank you much!

Aaron

Upvotes: 0

danielcahall
danielcahall

Reputation: 2742

If you check the string outside of the spark.sql call, you will notice the string has special characters due to how Python handles the backslash in a literal string:

>>> "select regexp_replace('ABC\123XYZ\456','[\\][\d][\d][\d]','') as new_value"
"select regexp_replace('ABCSXYZĮ','[\\][\\d][\\d][\\d]','') as new_value"

If we add the r prefix to the string, it will be considered a "raw" string (see here), which treats backslashes as literal characters:

>>> r"select regexp_replace('ABC\123XYZ\456','[\\][\d][\d][\d]','') as new_value"
"select regexp_replace('ABC\\123XYZ\\456','[\\\\][\\d][\\d][\\d]','') as new_value"

That looks correct, but if we just supply the raw Python string to spark.sql, the exception SparkException: The Spark SQL phase optimization failed with an internal error. is raised. In addition to making the Python string a raw string, we also need to make the strings in the Spark SQL query raw, which can again be accomplished by providing the r prefix (see here) to the strings used within the query, to ensure backslashes are escaped correctly by the Spark SQL parser.

It's a bit odd looking, but the end result is:

spark.sql(r"select regexp_replace(r'ABC\123XYZ\456',r'[\\][\d][\d][\d]','') as new_value").show()

And the output is, as expected:

+---------+
|new_value|
+---------+
|   ABCXYZ|
+---------+

It's also worth mentioning if you call regexp_replace as a function directly (from pyspark.sql.functions import regexp_replace), rather than within a Spark SQL query, the pattern in the regex seems to be implicitly treated as a raw string. It does not require an r prefix - the backslash in the first character class just needs to be escaped:

    from pyspark.sql import SparkSession
    from pyspark.sql.functions import regexp_replace
    spark = SparkSession\
        .builder\
        .appName("test")\
        .getOrCreate()
    df = spark.createDataFrame([(r'ABC\123XYZ\456',)], ['temp'])
    df.select(regexp_replace('temp', '[\\\][\d][\d][\d]', '').alias('new_value')).show()

Output:

+---------+
|new_value|
+---------+
|   ABCXYZ|
+---------+

Update

It looks like when running on EMR, raw strings aren't supported in Spark SQL, which manifests in parsing errors. In that case, making the Python string a raw string and adding escape characters to the SQL string should also work:

    from pyspark.sql import SparkSession
    spark = SparkSession\
        .builder\
        .appName("test")\
        .getOrCreate()
    spark.sql(r"select regexp_replace('ABC\\123XYZ\\456','[\\\\][\\\d][\\\d][\\\d]','') as new_value").show()

Output:

+---------+
|new_value|
+---------+
|   ABCXYZ|
+---------+

Note that the string the regex is being applied to only requires a single escape (\) for each \, whereas the pattern requires two escape characters (\\) for each \.

Upvotes: 1

Related Questions