user2335580
user2335580

Reputation: 408

Unable to get result of regex expression in pyspark dataframe

I am using Pyspark in Databricks with Spark 3.1.

I need to extract numbers from a text column in a dataframe using the regexp_extract_all function

Approach 1:

email_df11 = spark.sql("select New_id, regexp_extract_all(subject,'(?<!^DT!\\d)([D|d][T|t]\\d{12}|\\d{9,29})(?!\\d)', 1) as num_subject  from email_view")

This results num_subject column with empty lists.

No result

However when I use a view of the same data frame and run the query below. I am able to see the output.

Approach 2:

select New_id, regexp_extract_all(subject,'same regex as above', 1) as num_subject from email_view 

enter image description here

What do I need to change in Approach 1 in order to get a similar result.

Upvotes: 0

Views: 1039

Answers (1)

blackbishop
blackbishop

Reputation: 32650

You need to use four backslashes \\\\ to escape when using spark.sql:

email_df11 = spark.sql("select New_id, regexp_extract_all(subject,'(?<!^DT!\\\\d)([D|d][T|t]\\\\d{12}|\\\\d{9,29})(?!\\\\d)', 1) as num_subject  from email_view")

Or use python raw format string for the query:

email_df11 = spark.sql(r"select New_id, regexp_extract_all(subject,'(?<!^DT!\\d)([D|d][T|t]\\d{12}|\\d{9,29})(?!\\d)', 1) as num_subject  from email_view")

Upvotes: 2

Related Questions