Reputation: 419
I have written an SQL in Athena, that uses the regex_extract to extract substring from a column, it extracts string, where there is "X10003" and takes up to when the space appears. for example from 5570 - Site 811111 - X10003-10447-XXX-20443 (CAMP)
it extracts X10003-10447-XXX-20443
and it works fine using REGEXP_EXTRACT(site, 'X10033.*?\w+-\d+')
in Athena.
Now I am converting this to sparkSQL and it does not work properly, I have applied different ways but nothing works, for example I have used expression:
Regexp_extract("5570 - Site 811111 - X10003-10447-XXX-20443 (CAMP)", 'X10003([^\s]+)')
gives me -10447-XXX-20443 (CAMP)
which is not what I require.
Can anyone tells me what I am doing wrong?
Upvotes: 2
Views: 2723
Reputation: 626929
You can use
REGEXP_EXTRACT(site, '(X10003.*?[A-Za-z0-9]-[0-9]+)', 1)
See the regex demo.
The (X10003.*?[A-Za-z0-9]-[0-9]+)
pattern is a capturing group with ID 1 that matches
X10003
- a literal string.*?
- zero or more chars other than line break chars, as few as possible[A-Za-z0-9]
- an alphanumeric char-
- a hyphen[0-9]+
- one or more digits.Upvotes: 1
Reputation: 91430
Enlarge the capture group:
Regexp_extract("5570 - Site 811111 - X10003-10447-XXX-20443 (CAMP)", '(X10003[^\s]+)')
# here ___^
Upvotes: 0