Reputation: 3204
I have the following table:
CREATE TABLE test_regex (
drug TEXT
);
INSERT INTO test_regex
VALUES
('DRUGA 200 MG'),
('DRUGB 150 MCG'),
('DRUGC 1.5 GM BOX'),
('DRUGD 27.2 MG/5 ML VIAL')
;
I want to extract the following:
200 MG
150 MCG
1.5 GM
27.2 MG
What I have tried so far:
SELECT
substring(drug, '[0-9]*\.?[0-9]* MG|GM|MCG')
FROM
test_regex
;
Which results in:
200 MG
MCG
GM
27.2 MG
I've also tried:
SELECT
substring(drug, '[0-9]*\.?[0-9]* (MG|GM|MCG)')
FROM
test_regex
;
With this result:
MG
MCG
GM
MG
I think the problem lies in how the trailing (MG|GM|MCG)
group is handled but I couldn't find what I was looking for in the PostgreSQL Docs. I was expecting to get the first number section, then a space, then either MG, GM or MCG. I think its getting grouped as the number expression with MG, then either GM or MCG.
Upvotes: 4
Views: 2223
Reputation: 626845
The main idea is that you need to group alternatives that should match at one and the same location in a string. Also, I suggest using word boundaries to match the strings as whole words.
Also, note that substring
returns only that part of a match that is captured by the capturing group if there is any:
if the pattern contains any parentheses, the portion of the text that matched the first parenthesized subexpression (the one whose left parenthesis comes first) is returned.
So, the grouping construct you may use is a non-capturing group, (?:...|...)
.
You may use
substring(drug, '\m[0-9]*\.?[0-9]+\s*(?:MG|GM|MCG)\M')
See the online demo.
Pattern details
\m
- start of a word[0-9]*
- zero or more digits\.?
- an optional .
[0-9]+
- 1+ digits\s*
- 0+ whitespaces(?:MG|GM|MCG)
- either MG
, orGM
or MCG
(you may write it as (?:MC?G|GM)
to make it more efficient)\M
- end of word.Upvotes: 3