RnR
RnR

Reputation: 2115

Get all instances of a substring in a string in PostgreSQL?

I've got a table with entries that are similar to "something XXX010101 somethingelse XXX010102".

I'd like to extract the XXX01... parts from them - possibly using some separator if needed - right now I'm able to easily take out the first one using:

select distinct substring(content from 'XXX[0-9]+'), content from data where content ~ 'XXX[0-9]+'

The one idea I have is to create some monster regex to replace everything that's not an XXX substring... but to be true I hoped (and failed to find) for a simpler solution (like a g flag to substring).

When looking around I found that 8.3 introduced a regex_matches function that seems to be what I'd need - are there any alternatives in 8.2? Or an easy way to get it in 8.2?

How would you go about solving this? Or is an upgrade my best bet? ;) (it's a production system so the downtime and some risk of problems after migration is a source of hesitation ).

Thanks in advance.

-- expected output added --

for

"something XXX010101 something else XXX010102"

I'd like to get:

XXX010101
XXX010102

OR (less preferable)

XXX010101,XXX010102 

Upvotes: 4

Views: 4512

Answers (4)

Vérace
Vérace

Reputation: 908

Late to the party, but (code below is available here):

CREATE TABLE tab
(
  s_id INT  NOT NULL GENERATED ALWAYS AS IDENTITY,
  str  TEXT NOT NULL
);

Populate:

INSERT INTO tab (str) VALUES
 ('no match')
 ,('XXX010101')
 ,('XXX010101 XXX020202')
 ,('XXX010102 beginn')
 ,('end XXX010101')
 ,('middle XXX010101 match')
 ,('match middle XXX010102 and end XXX010102')
 ,('XXX010107, match beginn XXX010108 & middle')
 ,('XXX010109 begin and end XXX010110')
 ,('XXX01011 begin XXX010112 middle and end XXX010113')
 ,('XXX01014 begin XXX010115 close matches XX010113 XXXy010113 23624 ,XXX010116')
 ,('f XXX01017 B XXX010118 457XXX010119 XXXy XXX010120 overkill XXX010121end');

and then we run:

SELECT
  s_id,
  UNNEST(REGEXP_MATCHES(str, 'XXX\d{4,6}', 'g')),
  PG_TYPEOF(UNNEST(REGEXP_MATCHES(str, 'XXX\d{4,6}', 'g')))  -- not required
                                                             -- for testing
  str
FROM
  tab
ORDER BY s_id;

Result:

s_id    unnest  str
2   XXX010101   text
3   XXX010101   text
3   XXX020202   text
4   XXX010102   text
5   XXX010101   text
6   XXX010101   text
7   XXX010102   text
7   XXX010102   text
...
... snipped for brevity
...
12  XXX010118   text
12  XXX010119   text
12  XXX010120   text
12  XXX010121   text

You can also use (probably more performant - see fiddle):

  (REGEXP_MATCHES(str, 'XXX\d{4,6}', 'g'))[1],

If you have a .csv string, you can turn it into the result above using REGEXP_SPLIT_TO_TABLE().

SELECT
  REGEXP_SPLIT_TO_TABLE('XXX01017,XXX010118,XXX010119,XXX010120,XXX010121', ',');

Result:

regexp_split_to_table
XXX01017
XXX010118
XXX010119
XXX010120
XXX010121

Upvotes: 0

Evan Carroll
Evan Carroll

Reputation: 1

The fastest method is to use plperlu which works with 8.2.

CREATE LANGUAGE plperl

CREATE FUNCTION get_things(inputStr text)
RETURNS SETOF text
AS $BODY$
  return \@{[ $_[0] =~ m/(XXX\d{6})/g ]} 
$BODY$
LANGUAGE plperl
IMMUTABLE;

SELECT get_things(x)
FROM ( VALUES
  ('XXX010101 somethingelse XXX010102')
) AS t(x);

 get_things 
------------
 XXX010101
 XXX010102
(2 rows)

It also works with new versions of PostgreSQL.

Upvotes: 1

Erwin Brandstetter
Erwin Brandstetter

Reputation: 656251

I am somewhat hesitant to even post my answer. You really must upgrade. Version 8.2 reaches end of live right now. Follow the link @a_horse_with_no_name posted.

However, the problem got my attention. The following demo should work with PostgreSQL 8.2:

SELECT -- content,
         rtrim(
         regexp_replace(
         regexp_replace(
         replace(
         substring(content
        ,E'(XXX\\d+|XXX\\d+.*XXX\\d+)')  -- kill leading / trailing noise
        ,',',' ')                        -- remove all ","
        ,E'(XXX\\d+)', E'\\1,', 'g')     -- terminate X-strings with ","
        -- now we can use non-greedy terminated with ","
        ,E'(XXX\\d+?,)*?.*?(XXX\\d+?,)', E'\\1\\2', 'g')
        ,',') AS result
FROM    (VALUES
  ('no match')
 ,('XXX010101')
 ,('XXX010102 beginn')
 ,('end XXX010103')
 ,('middle XXX010104 match')
 ,('match middle XXX010105 and end XXX010106')
 ,('XXX010107, match beginn XXX010108 & middle')
 ,('XXX010109 begin and end XXX010110')
 ,('XXX01011 begin XXX010112 middle and end XXX010113')
 ,('XXX01014 begin XXX010115 close matches XX010113 XXXy010113 23624 ,XXX010116')
 ,('f XXX01017 B XXX010118 457XXX010119 XXXy XXX010120 overkill XXX010121end')
) data(content)

Result:

                     result
--------------------------------------------------
             -- first line is NULL
 XXX010101
 XXX010102
 XXX010103
 XXX010104
 XXX010105,XXX010106
 XXX010107,XXX010108
 XXX010109,XXX010110
 XXX01011,XXX010112,XXX010113
 XXX01014,XXX010115,XXX010116
 XXX01017,XXX010118,XXX010119,XXX010120,XXX010121

Some explanation:

  • no regex_matches() in version 8.2 as OP already stated
  • but there is regexp_replace() which can use a g flag (replace globally)
  • we cannot mix greedy and non-greedy quantifiers in the same regular expression
  • so I terminate the wanted strings with a ,, after removing all other occurances of , Could be any character not part of the wanted strings, but , can serve as separator in the result.
  • first cut leading and trailing noise
  • then globally replace everything in between wanted strings.
  • to make that work use (XXX\\d+?,)*? to capture any number of leading occurances of wanted string.
  • final rtrim() removes trailing ,

  • In PsotgreSQL 8.3+ you could use regexp_split_to_table() to split the wanted strings into single rows. In 8.2 you'll have to come up with something of your own. I would write a plgpsql function ...

This makes heavy use of a features of PostgreSQL's POSIX Regular Expressions (link to version 8.2!)

Upvotes: 3

xQbert
xQbert

Reputation: 35323

How about something like this (assuming the values you're looking for are contained in a separate table)... I'll make no claims as to good performance though...

Select A.text, B.Text2, B.Val 
FROM A 
INNER JOIN B ON B.Text2 LIKE ('%' || A.Text || '%')

let table A contain all possible combinations of XXX010101... that your looking for

text
XXX010101
XXX010102
XXX010103

let table B contain all the search texts you want

text 2                                              val
something XXX010101 something else XXX010102        1
yet another XXX010102 and this XXX010103            2
XXX010105                                           3
XXX010103                                           4

Results:

text            text2                                           VAL
XXX010101   something XXX010101 something else XXX010102    1
XXX010102   something XXX010101 something else XXX010102    1
XXX010102   yet another XXX010102 and this XXX010103        2
XXX010103   yet another XXX010102 and this XXX010103        2
XXX010103   XXX010103                                       4

--------WRONG but misunderstood the question...

What's wrong with the replace function? http://www.postgresql.org/docs/8.2/interactive/functions-string.html

replace( 'abcdefabcdef', 'cd', '')

so the characters are set to empty string.

Upvotes: 1

Related Questions