Reputation: 2115
I've got a table with entries that are similar to "something XXX010101 somethingelse XXX010102".
I'd like to extract the XXX01... parts from them - possibly using some separator if needed - right now I'm able to easily take out the first one using:
select distinct substring(content from 'XXX[0-9]+'), content from data where content ~ 'XXX[0-9]+'
The one idea I have is to create some monster regex to replace everything that's not an XXX substring... but to be true I hoped (and failed to find) for a simpler solution (like a g flag to substring).
When looking around I found that 8.3 introduced a regex_matches function that seems to be what I'd need - are there any alternatives in 8.2? Or an easy way to get it in 8.2?
How would you go about solving this? Or is an upgrade my best bet? ;) (it's a production system so the downtime and some risk of problems after migration is a source of hesitation ).
Thanks in advance.
-- expected output added --
for
"something XXX010101 something else XXX010102"
I'd like to get:
XXX010101
XXX010102
OR (less preferable)
XXX010101,XXX010102
Upvotes: 4
Views: 4512
Reputation: 908
Late to the party, but (code below is available here):
CREATE TABLE tab
(
s_id INT NOT NULL GENERATED ALWAYS AS IDENTITY,
str TEXT NOT NULL
);
Populate:
INSERT INTO tab (str) VALUES
('no match')
,('XXX010101')
,('XXX010101 XXX020202')
,('XXX010102 beginn')
,('end XXX010101')
,('middle XXX010101 match')
,('match middle XXX010102 and end XXX010102')
,('XXX010107, match beginn XXX010108 & middle')
,('XXX010109 begin and end XXX010110')
,('XXX01011 begin XXX010112 middle and end XXX010113')
,('XXX01014 begin XXX010115 close matches XX010113 XXXy010113 23624 ,XXX010116')
,('f XXX01017 B XXX010118 457XXX010119 XXXy XXX010120 overkill XXX010121end');
and then we run:
SELECT
s_id,
UNNEST(REGEXP_MATCHES(str, 'XXX\d{4,6}', 'g')),
PG_TYPEOF(UNNEST(REGEXP_MATCHES(str, 'XXX\d{4,6}', 'g'))) -- not required
-- for testing
str
FROM
tab
ORDER BY s_id;
Result:
s_id unnest str
2 XXX010101 text
3 XXX010101 text
3 XXX020202 text
4 XXX010102 text
5 XXX010101 text
6 XXX010101 text
7 XXX010102 text
7 XXX010102 text
...
... snipped for brevity
...
12 XXX010118 text
12 XXX010119 text
12 XXX010120 text
12 XXX010121 text
You can also use (probably more performant - see fiddle):
(REGEXP_MATCHES(str, 'XXX\d{4,6}', 'g'))[1],
If you have a .csv string, you can turn it into the result above using REGEXP_SPLIT_TO_TABLE().
SELECT
REGEXP_SPLIT_TO_TABLE('XXX01017,XXX010118,XXX010119,XXX010120,XXX010121', ',');
Result:
regexp_split_to_table
XXX01017
XXX010118
XXX010119
XXX010120
XXX010121
Upvotes: 0
Reputation: 1
The fastest method is to use plperlu which works with 8.2.
CREATE LANGUAGE plperl
CREATE FUNCTION get_things(inputStr text)
RETURNS SETOF text
AS $BODY$
return \@{[ $_[0] =~ m/(XXX\d{6})/g ]}
$BODY$
LANGUAGE plperl
IMMUTABLE;
SELECT get_things(x)
FROM ( VALUES
('XXX010101 somethingelse XXX010102')
) AS t(x);
get_things
------------
XXX010101
XXX010102
(2 rows)
It also works with new versions of PostgreSQL.
Upvotes: 1
Reputation: 656251
I am somewhat hesitant to even post my answer. You really must upgrade. Version 8.2 reaches end of live right now. Follow the link @a_horse_with_no_name posted.
However, the problem got my attention. The following demo should work with PostgreSQL 8.2:
SELECT -- content,
rtrim(
regexp_replace(
regexp_replace(
replace(
substring(content
,E'(XXX\\d+|XXX\\d+.*XXX\\d+)') -- kill leading / trailing noise
,',',' ') -- remove all ","
,E'(XXX\\d+)', E'\\1,', 'g') -- terminate X-strings with ","
-- now we can use non-greedy terminated with ","
,E'(XXX\\d+?,)*?.*?(XXX\\d+?,)', E'\\1\\2', 'g')
,',') AS result
FROM (VALUES
('no match')
,('XXX010101')
,('XXX010102 beginn')
,('end XXX010103')
,('middle XXX010104 match')
,('match middle XXX010105 and end XXX010106')
,('XXX010107, match beginn XXX010108 & middle')
,('XXX010109 begin and end XXX010110')
,('XXX01011 begin XXX010112 middle and end XXX010113')
,('XXX01014 begin XXX010115 close matches XX010113 XXXy010113 23624 ,XXX010116')
,('f XXX01017 B XXX010118 457XXX010119 XXXy XXX010120 overkill XXX010121end')
) data(content)
Result:
result
--------------------------------------------------
-- first line is NULL
XXX010101
XXX010102
XXX010103
XXX010104
XXX010105,XXX010106
XXX010107,XXX010108
XXX010109,XXX010110
XXX01011,XXX010112,XXX010113
XXX01014,XXX010115,XXX010116
XXX01017,XXX010118,XXX010119,XXX010120,XXX010121
regex_matches()
in version 8.2 as OP already statedregexp_replace()
which can use a g
flag (replace g
lobally),
, after removing all other occurances of ,
Could be any character not part of the wanted strings, but ,
can serve as separator in the result.(XXX\\d+?,)*?
to capture any number of leading occurances of wanted string.final rtrim()
removes trailing ,
In PsotgreSQL 8.3+ you could use regexp_split_to_table()
to split the wanted strings into single rows. In 8.2 you'll have to come up with something of your own. I would write a plgpsql function ...
This makes heavy use of a features of PostgreSQL's POSIX Regular Expressions (link to version 8.2!)
Upvotes: 3
Reputation: 35323
How about something like this (assuming the values you're looking for are contained in a separate table)... I'll make no claims as to good performance though...
Select A.text, B.Text2, B.Val
FROM A
INNER JOIN B ON B.Text2 LIKE ('%' || A.Text || '%')
let table A contain all possible combinations of XXX010101... that your looking for
text
XXX010101
XXX010102
XXX010103
let table B contain all the search texts you want
text 2 val
something XXX010101 something else XXX010102 1
yet another XXX010102 and this XXX010103 2
XXX010105 3
XXX010103 4
Results:
text text2 VAL
XXX010101 something XXX010101 something else XXX010102 1
XXX010102 something XXX010101 something else XXX010102 1
XXX010102 yet another XXX010102 and this XXX010103 2
XXX010103 yet another XXX010102 and this XXX010103 2
XXX010103 XXX010103 4
--------WRONG but misunderstood the question...
What's wrong with the replace function? http://www.postgresql.org/docs/8.2/interactive/functions-string.html
replace( 'abcdefabcdef', 'cd', '')
so the characters are set to empty string.
Upvotes: 1