Aleksander Lipka
Aleksander Lipka

Reputation: 528

Spark sql regexp_extract selecting digits error

Hello people of StackOverflow
I'm trying to extract a number from house number but for some reason I cannot.
I have a working code from teradata that I'm trying to convert to pyspark:

--Original code from teradata:
CAST(REGEXP_SUBSTR(adr_house_no, '\d+') AS INTEGER) AS adr_house_no, 
REGEXP_SUBSTR(adr_house_no, '[A-Za-z]+$') AS adr_house_no_ad

Here is the query I'm using:

result = spark.sql('''

    SELECT
        adr_house_no as house_no,
        CAST(regexp_extract(adr_house_no, '(\d+)') AS INT) as adr_house_no,
        regexp_extract(adr_house_no, '([A-Za-z]+$)') as adr_house_no_ad
    FROM subscriber_info_address_subscriber

    ''').show()

The result is as follows:

+--------+------------+---------------+
|house_no|adr_house_no|adr_house_no_ad|
+--------+------------+---------------+
| LTECXYD|        null|        LTECXYD|
| LTECXYD|        null|        LTECXYD|
|     51l|        null|              l|
|     84J|        null|              J|
|     96t|        null|              t|
|     919|        null|               |
|     59e|        null|              e|
|     919|        null|               |
| LTECXYD|        null|        LTECXYD|
|     67s|        null|              s|
|     4-6|        null|               |
|     361|        null|               |
| LTECXYD|        null|        LTECXYD|
| LTECXYD|        null|        LTECXYD|
| LTECXYD|        null|        LTECXYD|
|     842|        null|               |
| LTECXYD|        null|        LTECXYD|
|     98r|        null|              r|
|     361|        null|               |
| LTECXYD|        null|        LTECXYD|
+--------+------------+---------------+

The part of extracting house letter works but for some reason I cannot mark any digit. I tried selecting one digit \d or two.
I tried regexp_extract(adr_house_no, '\d+') without parenthesis but it also doesn't work.
What does work is regexp_extract(adr_house_no, '[0-9]+')
Why is that? Why doesn't \d works in pyspark?

Upvotes: 4

Views: 9051

Answers (3)

yyyyyyyy
yyyyyyyy

Reputation: 61

actually \d is supported in sql format but it requires adding r before the string and double back slash, for example

result = spark.sql(r'''

    SELECT
        adr_house_no as house_no,
        CAST(regexp_extract(adr_house_no, '(\\d+)') AS INT) as adr_house_no,
        regexp_extract(adr_house_no, '([A-Za-z]+$)') as adr_house_no_ad
    FROM subscriber_info_address_subscriber

    ''').show()

Upvotes: 6

Karthik
Karthik

Reputation: 1171

hi I have figured out problem,

Since you are writing in sql format, as sql dont have \d option its not giving any value. Hence you need to write as '[0-9]+' to obtain any digit.

In your case replace as below:

spark.sql("SELECT adr_house_no as house_no, CAST(regexp_extract(adr_house_no, '([0-9]+)',1) AS INT) as adr_house_no, regexp_extract(adr_house_no, '([A-Za-z]+$)',1) as adr_house_no_ad FROM subscriber_info_address_subscriber").show()

Alternatively if you want to use regular expressions of python then you need to write your code in dataframes as below then it will work:

df.withColumn('house_no',regexp_extract('adr_house_no','(\d+)',1).cast('int')).withColumn('adr_house_no_ad',regexp_extract('adr_house_no', '([A-Za-z]+$)',1)).show()

Upvotes: 1

Karthik
Karthik

Reputation: 1171

since in regular expression,paranthesis indicates grouping.you have to mention grouping number also i.e. the group number which you want to extract. Grouping number starts from one. Suppose your pattern contains 3 groups and you need to extract 2nd one, then you mention 2.

In your case, there is one group and you need that one. Hence you need to write as regexp_extract('adr_house_no', '(\d+)',1).

Also note syntax for regexp_extract(str, pattern, idx) Extract a specific(idx) group identified by a java regex, from the specified string column.

Upvotes: 0

Related Questions