How do I extract a number followed by a specific string in PostgreSQL?

Question

I have the following table:

CREATE TABLE test_regex (
    drug TEXT
);

INSERT INTO test_regex
VALUES
    ('DRUGA 200 MG'),
    ('DRUGB 150 MCG'),
    ('DRUGC 1.5 GM BOX'),
    ('DRUGD 27.2 MG/5 ML VIAL')
;

I want to extract the following:

200 MG
150 MCG
1.5 GM
27.2 MG

What I have tried so far:

SELECT
    substring(drug, '[0-9]*\.?[0-9]* MG|GM|MCG')
FROM
    test_regex
;

Which results in:

 200 MG
 MCG
 GM
 27.2 MG

I've also tried:

SELECT
    substring(drug, '[0-9]*\.?[0-9]* (MG|GM|MCG)')
FROM
    test_regex
;

With this result:

 MG
 MCG
 GM
 MG

I think the problem lies in how the trailing (MG|GM|MCG) group is handled but I couldn't find what I was looking for in the PostgreSQL Docs. I was expecting to get the first number section, then a space, then either MG, GM or MCG. I think its getting grouped as the number expression with MG, then either GM or MCG.

Wiktor Stribiżew · Accepted Answer

The main idea is that you need to group alternatives that should match at one and the same location in a string. Also, I suggest using word boundaries to match the strings as whole words.

Also, note that substring returns only that part of a match that is captured by the capturing group if there is any:

if the pattern contains any parentheses, the portion of the text that matched the first parenthesized subexpression (the one whose left parenthesis comes first) is returned.

So, the grouping construct you may use is a non-capturing group, (?:...|...).

You may use

substring(drug, '\m[0-9]*\.?[0-9]+\s*(?:MG|GM|MCG)\M')

See the online demo.

Pattern details

\m - start of a word
[0-9]* - zero or more digits
\.? - an optional .
[0-9]+ - 1+ digits
\s* - 0+ whitespaces
(?:MG|GM|MCG) - either MG, orGM or MCG (you may write it as (?:MC?G|GM) to make it more efficient)
\M - end of word.

How do I extract a number followed by a specific string in PostgreSQL?

Answers (1)

Related Questions