Sifu
Sifu

Reputation: 65

Impala equivalent to regexp_substr

I have a Oracle query which needs to be converted to Impala. I know that Impala has regexp_extract to return the string based on the regular expression that I provide. What my concern is if there is more that one occurance of the same string how do I capture that?

Let's say the dummy Oracle code I have: Select t1.r1, REGEXP_SUBSTR("RMG123/RMG987",'(RMG\d{3})+',1,1) as r2, REGEXP_SUBSTR("RMG123/RMG987",'(RMG\d{3})+',1,2) as r3 From t1;

Here I will get value of r2 and r3 as RMG123 and RMG987 respectively.

When I converted it into Impala equivalent as Select t1.r1, regexp_extract("RMG123/RMG987",'(RMG\\d{3})+',1) as r2, regexp_extract("RMG123/RMG987",'(RMG\\d{3})+',2) as r3 From t1;

I got the value for r2 as RMG123 but didn't get any value for r3 as regexp_extract is not allowing to check for second occurance of the pattern.

Note that the data RMG123/RMH987 is just a sample data. The user doesn't know that these two field are seperated by /.

Please suggest a way in Impala where I can achieve the result as same as in Oracle.

Upvotes: 1

Views: 1248

Answers (1)

leftjoin
leftjoin

Reputation: 38325

In Impala regexp_extract the last parameter is a group () number in a pattern, not n-th occurence number as in Oracle regesp_substr. Your pattern contains single group number 1, no group 2. And if you want to extract 2nd occurence of substring, change the pattern for example like this:

regexp_extract("RMG123/RMG987",'(RMG\d{3})+.*?(RMG\d{3})',2)

Pattern '(RMG\\d{3})+.*?(RMG\\d{3})' means:

(RMG\\d{3})+ - first group 1+ times. + here means that two or more pattern occurrences in a row will be considered as single one.

.*? - some delimiters any times non-greedy

(RMG\\d{3}) - Second group - this is second occurrence of the pattern you want to extract.

+ sign after first group in the pattern is significant here because without it, multiple occurrences of the group without any delimiters will be considered as new occurrence, with + sign, multiple occurrences will be considered as the single one.

For example if initial string is RMG123RMG980/RMG987,

regexp_extract("RMG123RMG980/RMG987",'(RMG\\d{3})+.*?(RMG\\d{3})',2)

will produce RMG987

And the same pattern without +

 regexp_extract("RMG123RMG980/RMG987",'(RMG\\d{3}).*?(RMG\\d{3})',2)

will produce RMG980

Unfortunately I have no Impala to test it, the same works in Hive, Impala regex flavor may be a bit different.

Upvotes: 1

Related Questions