Reputation: 103
My company is in a migration project from MapR to databricks, and we have the following piece of code that used to work fine in this platform, but once in databricks it stopped working. I noticed that this is failing with this specific regular expression because this is not getting any error with others.
The Error is "Error while obtaining a new communication channel", and after that, we can not continue writing code and testing, something breaks.
I am attaching a screenshot for reference.
[
import pyspark.sql.functions as pyfunc
df=spark.read.parquet("/mnt/gpdipedlstgamrasp50565/stg_db/intermediate/ODX/ODW/STUDY_REPORT/Current/Data/")
df.count()
df = df.withColumn('CSR_RESULTS_SUMMARY', pyfunc.regexp_replace(pyfunc.col('CSR_RESULTS_SUMMARY'),u'([\ud800-\udfff\ufdd0-\ufdef\ufffe-\uffff+])',''))
df.show()
Thank you very much in advance.
Upvotes: 0
Views: 2200
Reputation: 32700
I suspect the error is caused by the u''
in the regex pattern you pass to regexp_replace
function. You should use r''
for python raw string:
df = df.withColumn(
'column',
pyfunc.regexp_replace(pyfunc.col('column'), r'[\ud800-\udfff\ufdd0-\ufdef\ufffe-\uffff]+', '')
)
Or use two backslashes \\
to escape:
df = df.withColumn(
'column',
pyfunc.regexp_replace(pyfunc.col('column'), '[\\ud800-\\udfff\\ufdd0-\\ufdef\\ufffe-\\uffff]+', '')
)
Upvotes: 1