Reputation: 97
I have data imbedded in my text fields that I need to suppress. The data is in the format of \nnn where nnn is a 3 digit number. I tried the following:
spark.sql("select regexp_replace('ABC\123XYZ\456','[\d][\d][\d]','') as new_value").show()
I expected the result to be "ABC\XYZ\", but what I got instead was:
+---------+
|new_value|
+---------+
| ABCSXYZĮ|
+---------+
I'm not sure what the other characters are after the C and after Z.
However I need to remove the backslash as well. To get rid of the backslash, I then tried this:
spark.sql("select regexp_replace('ABC\123XYZ\456','[\\][\d][\d][\d]','') as new_value").show()
However it just simply crashed on me. No matter how I try to escape the backslash, it failed. I tested the regular expression on regex101.com and it was fine.
I appreciate any suggestions!
EDIT: I tried Daniel's suggestion but received the following errors:
>>> spark.sql(r"select regexp_replace(r'ABC\123XYZ\456',r'[\\][\d][\d][\d]','') as new_value").show()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/session.py", line 649, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 134, in deco
raise_from(converted)
File "<string>", line 3, in raise_from
pyspark.sql.utils.ParseException:
Literals of type 'R' are currently not supported.(line 1, pos 22)
== SQL ==
select regexp_replace(r'ABC\123XYZ\456',r'[\\][\d][\d][\d]','') as new_value
----------------------^^^
Thoughts?
Upvotes: 0
Views: 885
Reputation: 97
Daniel,
Thank you SO MUCH for your help! With the examples you gave, I was able to tweak my SQL statement so it would work. Besides putting the r before the select statement string, I had to change the regular expression to this:
'[\\\\][\\\d][\\\d][\\\d]'
Thank you much!
Aaron
Upvotes: 0
Reputation: 2742
If you check the string outside of the spark.sql
call, you will notice the string has special characters due to how Python handles the backslash in a literal string:
>>> "select regexp_replace('ABC\123XYZ\456','[\\][\d][\d][\d]','') as new_value"
"select regexp_replace('ABCSXYZĮ','[\\][\\d][\\d][\\d]','') as new_value"
If we add the r
prefix to the string, it will be considered a "raw" string (see here), which treats backslashes as literal characters:
>>> r"select regexp_replace('ABC\123XYZ\456','[\\][\d][\d][\d]','') as new_value"
"select regexp_replace('ABC\\123XYZ\\456','[\\\\][\\d][\\d][\\d]','') as new_value"
That looks correct, but if we just supply the raw Python string to spark.sql
, the exception SparkException: The Spark SQL phase optimization failed with an internal error.
is raised. In addition to making the Python string a raw string, we also need to make the strings in the Spark SQL query raw, which can again be accomplished by providing the r
prefix (see here) to the strings used within the query, to ensure backslashes are escaped correctly by the Spark SQL parser.
It's a bit odd looking, but the end result is:
spark.sql(r"select regexp_replace(r'ABC\123XYZ\456',r'[\\][\d][\d][\d]','') as new_value").show()
And the output is, as expected:
+---------+
|new_value|
+---------+
| ABCXYZ|
+---------+
It's also worth mentioning if you call regexp_replace
as a function directly (from pyspark.sql.functions import regexp_replace
), rather than within a Spark SQL query, the pattern in the regex seems to be implicitly treated as a raw string. It does not require an r
prefix - the backslash in the first character class just needs to be escaped:
from pyspark.sql import SparkSession
from pyspark.sql.functions import regexp_replace
spark = SparkSession\
.builder\
.appName("test")\
.getOrCreate()
df = spark.createDataFrame([(r'ABC\123XYZ\456',)], ['temp'])
df.select(regexp_replace('temp', '[\\\][\d][\d][\d]', '').alias('new_value')).show()
Output:
+---------+
|new_value|
+---------+
| ABCXYZ|
+---------+
It looks like when running on EMR, raw strings aren't supported in Spark SQL, which manifests in parsing errors. In that case, making the Python string a raw string and adding escape characters to the SQL string should also work:
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.appName("test")\
.getOrCreate()
spark.sql(r"select regexp_replace('ABC\\123XYZ\\456','[\\\\][\\\d][\\\d][\\\d]','') as new_value").show()
Output:
+---------+
|new_value|
+---------+
| ABCXYZ|
+---------+
Note that the string the regex is being applied to only requires a single escape (\
) for each \
, whereas the pattern requires two escape characters (\\
) for each \
.
Upvotes: 1