Reputation: 65
I have a Oracle query which needs to be converted to Impala. I know that Impala has regexp_extract to return the string based on the regular expression that I provide. What my concern is if there is more that one occurance of the same string how do I capture that?
Let's say the dummy Oracle code I have:
Select t1.r1, REGEXP_SUBSTR("RMG123/RMG987",'(RMG\d{3})+',1,1) as r2, REGEXP_SUBSTR("RMG123/RMG987",'(RMG\d{3})+',1,2) as r3 From t1;
Here I will get value of r2 and r3 as RMG123 and RMG987 respectively.
When I converted it into Impala equivalent as
Select t1.r1, regexp_extract("RMG123/RMG987",'(RMG\\d{3})+',1) as r2, regexp_extract("RMG123/RMG987",'(RMG\\d{3})+',2) as r3 From t1;
I got the value for r2 as RMG123 but didn't get any value for r3 as regexp_extract is not allowing to check for second occurance of the pattern.
Note that the data RMG123/RMH987 is just a sample data. The user doesn't know that these two field are seperated by /.
Please suggest a way in Impala where I can achieve the result as same as in Oracle.
Upvotes: 1
Views: 1248
Reputation: 38325
In Impala regexp_extract the last parameter is a group () number in a pattern, not n-th occurence number as in Oracle regesp_substr. Your pattern contains single group number 1, no group 2. And if you want to extract 2nd occurence of substring, change the pattern for example like this:
regexp_extract("RMG123/RMG987",'(RMG\d{3})+.*?(RMG\d{3})',2)
Pattern '(RMG\\d{3})+.*?(RMG\\d{3})'
means:
(RMG\\d{3})+
- first group 1+ times. + here means that two or more pattern occurrences in a row will be considered as single one.
.*?
- some delimiters any times non-greedy
(RMG\\d{3})
- Second group - this is second occurrence of the pattern you want to extract.
+
sign after first group in the pattern is significant here because without it, multiple occurrences of the group without any delimiters will be considered as new occurrence, with + sign, multiple occurrences will be considered as the single one.
For example if initial string is RMG123RMG980/RMG987
,
regexp_extract("RMG123RMG980/RMG987",'(RMG\\d{3})+.*?(RMG\\d{3})',2)
will produce RMG987
And the same pattern without +
regexp_extract("RMG123RMG980/RMG987",'(RMG\\d{3}).*?(RMG\\d{3})',2)
will produce RMG980
Unfortunately I have no Impala to test it, the same works in Hive, Impala regex flavor may be a bit different.
Upvotes: 1