Raymond
Raymond

Reputation: 1

REGEXP_SUBSTR Redshift

I am trying to extract a substring from a text string in postgresql. The column name of the text string is short_description and I am using the REGEXP_SUBSTR function to define a regex that will return only the portion that I want:

SELECT  short_description, 
REGEXP_SUBSTR(short_description,'\\[[^=[]*') AS space
FROM my_table

This returns the following:

short_description                                        space
----------------------------------------------------------------------------
[ABC12][1][ABCDEFG] ACB DEF [HIJ]           |            [ABC12]

What I would like to pull is the following:

short_description                                        space
----------------------------------------------------------------------------
[ABC12][1][ABCDEFG] ACB DEF [HIJ]           |            [ABCDEFG]

Any ideas?

Upvotes: 0

Views: 7961

Answers (2)

Joe Harris
Joe Harris

Reputation: 14035

You can use the Regex character classes to help with this kind of match. Here I'm looking for letters only, surrounded by brackets, and a following space. Note the use of double backslash \\ to escape the literal brackets and the double brackets [[:a:]] for the character class

SELECT REGEXP_SUBSTR('[ABC12][1][ABCDEFG] ACB DEF [HIJ]','\\[[[:alpha:]]+\\] ');
 regexp_substr
---------------
 [ABCDEFG]

You could also use the SPLIT_PART function achieve something similar by splitting on a closing bracket ] and choosing the 3rd value.

SELECT SPLIT_PART('[ABC12][1][ABCDEFG] ACB DEF [HIJ]',']',3);
 split_part
------------
 [ABCDEFG

I recommend using the built in functions rather than a UDF if at all possible. UDFs are fantastic when you need them but they do incur a performance penalty.

Upvotes: 1

Jon Scott
Jon Scott

Reputation: 4354

Here you go. I have found the right regexp expression using https://txt2re.com

Then, I have implemented it as a python redshift UDF

create or replace function f_regex (input_str varchar(max),regex_expression varchar(max))
  returns VARCHAR(max)
stable
as $$
import re
rg = re.compile(regex_expression,re.IGNORECASE|re.DOTALL)
return rg.search(input_str).group(1)
$$ language plpythonu;

select f_regex('[ABC12][1][ABCDEFG] ACB DEF [HIJ] '::text,'.*?\\[.*?\\].*?\\[.*?\\](\\[.*?\\])'::text);

Once you have created the function, you can use it within any of your redshift selects.

So, in your case:

SELECT  short_description, 
f_regex(short_description::text,'.*?\\[.*?\\].*?\\[.*?\\](\\[.*?\\])'::text) AS space
FROM my_table

Upvotes: 0

Related Questions