Reputation: 528
Hello people of StackOverflow
I'm trying to extract a number from house number but for some reason I cannot.
I have a working code from teradata that I'm trying to convert to pyspark:
--Original code from teradata:
CAST(REGEXP_SUBSTR(adr_house_no, '\d+') AS INTEGER) AS adr_house_no,
REGEXP_SUBSTR(adr_house_no, '[A-Za-z]+$') AS adr_house_no_ad
Here is the query I'm using:
result = spark.sql('''
SELECT
adr_house_no as house_no,
CAST(regexp_extract(adr_house_no, '(\d+)') AS INT) as adr_house_no,
regexp_extract(adr_house_no, '([A-Za-z]+$)') as adr_house_no_ad
FROM subscriber_info_address_subscriber
''').show()
The result is as follows:
+--------+------------+---------------+
|house_no|adr_house_no|adr_house_no_ad|
+--------+------------+---------------+
| LTECXYD| null| LTECXYD|
| LTECXYD| null| LTECXYD|
| 51l| null| l|
| 84J| null| J|
| 96t| null| t|
| 919| null| |
| 59e| null| e|
| 919| null| |
| LTECXYD| null| LTECXYD|
| 67s| null| s|
| 4-6| null| |
| 361| null| |
| LTECXYD| null| LTECXYD|
| LTECXYD| null| LTECXYD|
| LTECXYD| null| LTECXYD|
| 842| null| |
| LTECXYD| null| LTECXYD|
| 98r| null| r|
| 361| null| |
| LTECXYD| null| LTECXYD|
+--------+------------+---------------+
The part of extracting house letter works but for some reason I cannot mark any digit. I tried selecting one digit \d
or two.
I tried regexp_extract(adr_house_no, '\d+')
without parenthesis but it also doesn't work.
What does work is regexp_extract(adr_house_no, '[0-9]+')
Why is that? Why doesn't \d
works in pyspark?
Upvotes: 4
Views: 9051
Reputation: 61
actually \d is supported in sql format but it requires adding r before the string and double back slash, for example
result = spark.sql(r'''
SELECT
adr_house_no as house_no,
CAST(regexp_extract(adr_house_no, '(\\d+)') AS INT) as adr_house_no,
regexp_extract(adr_house_no, '([A-Za-z]+$)') as adr_house_no_ad
FROM subscriber_info_address_subscriber
''').show()
Upvotes: 6
Reputation: 1171
hi I have figured out problem,
Since you are writing in sql format, as sql dont have \d option its not giving any value. Hence you need to write as '[0-9]+' to obtain any digit.
In your case replace as below:
spark.sql("SELECT adr_house_no as house_no, CAST(regexp_extract(adr_house_no, '([0-9]+)',1) AS INT) as adr_house_no, regexp_extract(adr_house_no, '([A-Za-z]+$)',1) as adr_house_no_ad FROM subscriber_info_address_subscriber").show()
Alternatively if you want to use regular expressions of python then you need to write your code in dataframes as below then it will work:
df.withColumn('house_no',regexp_extract('adr_house_no','(\d+)',1).cast('int')).withColumn('adr_house_no_ad',regexp_extract('adr_house_no', '([A-Za-z]+$)',1)).show()
Upvotes: 1
Reputation: 1171
since in regular expression,paranthesis indicates grouping.you have to mention grouping number also i.e. the group number which you want to extract. Grouping number starts from one. Suppose your pattern contains 3 groups and you need to extract 2nd one, then you mention 2.
In your case, there is one group and you need that one. Hence you need to write as regexp_extract('adr_house_no', '(\d+)',1).
Also note syntax for regexp_extract(str, pattern, idx) Extract a specific(idx) group identified by a java regex, from the specified string column.
Upvotes: 0