HP.
HP.

Reputation: 19896

Regular expression to remove element not match specific prefix

I am doing this in Impala or Hive. Basically let say I have a string like this

f-150:aa|f-150:cc|g-210:dd

Each element is separated by the pipe |. Each has prefix f-150 or whatever. I want to be able to remove the prefix and keep only element that matches specific prefix. For example, if the prefix is f-150, I want the final string after regex_replace is

aa|cc

dd is removed because g-210 is different prefix and not match, therefore the whole element is removed.

Any idea how to do this using string expression in one SQL?

Thanks

UPDATE 1

I tried this in Impala:

select regexp_extract('f-150:aa|f-150:cc|g-210:dd','(?:(?:|(\\|))f-150|keep|those):|(?:^|\\|)\\w-\\d{3}:\\w{2}',0);

But got this output:

f-150:aa

In Hive, I got NULL.

Upvotes: 1

Views: 899

Answers (3)

The fourth bird
The fourth bird

Reputation: 163632

You could match the values that you want to remove and then replace with an empty string:

f-150:|\|[^:]+:[^|]+$|[^|]+:[^|]+\|

f-150:|\\|[^:]+:[^|]+$|[^|]+:[^|]+\\|

Explanation

  • f-150: Match literally
  • | Or
  • \|[^:]+:[^|]+$ Match a pipe, not a colon one or more times followed by not a pipe one or more times and assert the end of the line
  • | Or
  • [^|]+:[^|]+\| Match not a pipe one or more times, a colon followed by matching not a pipe one or more times and then match a pipe

Test with multiple lines and combinations

Upvotes: 1

Fact
Fact

Reputation: 2460

You may have to loop through the string until the end to get the all the matching sub string. Look ahead syntax is not supported in most sql so above regexp might not be suitable for SQL syntax. For you purpose you can do something like creating a table to loop through just to mimic Oracle's level syntax and join with your table containing the string.

With loop_tab as (
Select 1 loop union all
Select 2 union all
select 3 union all
select 4 union all
select 5),
string_tab as(Select 'f-150:aa|ade|f-150:ce|akg|f-150:bb|'::varchar(40) as str)
Select regexp_substr(str,'(f\\-150\\:\\w+\\|)',1,loop)
from string_tab
join loop_tab on 1=1

Output:

regexp_substr
f-150:aa|
f-150:ce|
f-150:bb|

Upvotes: 0

wp78de
wp78de

Reputation: 19000

The regexyou in question could look like this:

(?:(?:|(\\|))f-150|keep|those):|(?:^|\\|)\\w-\\d{3}:\\w{2}

I have added some pseudo keywords to retain, but I am sure you get the idea:

  • Wholy match elements that should be dropped but only match the prefix for those that should be retained.
  • To keep the separator intact, match | at the beginning of an element in group 1 and put it back in the replacement with $1.

Demo

According to the documentation, your query should be written like a Java regex; likewise, this should perform like this code sample in Java.

Upvotes: 1

Related Questions