Reputation: 408
I am using Pyspark in Databricks with Spark 3.1.
I need to extract numbers from a text column in a dataframe using the regexp_extract_all
function
Approach 1:
email_df11 = spark.sql("select New_id, regexp_extract_all(subject,'(?<!^DT!\\d)([D|d][T|t]\\d{12}|\\d{9,29})(?!\\d)', 1) as num_subject from email_view")
This results num_subject column with empty lists.
However when I use a view of the same data frame and run the query below. I am able to see the output.
Approach 2:
select New_id, regexp_extract_all(subject,'same regex as above', 1) as num_subject from email_view
What do I need to change in Approach 1 in order to get a similar result.
Upvotes: 0
Views: 1039
Reputation: 32650
You need to use four backslashes \\\\
to escape when using spark.sql
:
email_df11 = spark.sql("select New_id, regexp_extract_all(subject,'(?<!^DT!\\\\d)([D|d][T|t]\\\\d{12}|\\\\d{9,29})(?!\\\\d)', 1) as num_subject from email_view")
Or use python raw format string for the query:
email_df11 = spark.sql(r"select New_id, regexp_extract_all(subject,'(?<!^DT!\\d)([D|d][T|t]\\d{12}|\\d{9,29})(?!\\d)', 1) as num_subject from email_view")
Upvotes: 2