Extracting a numerical value from a paragraph based on preceding words

Question

I'm working with some big text fields in columns. After some cleanup I have something like below:

truth_val: ["5"]
xerb Scale: ["2"]
perb Scale: ["1"]

I want to extract the number 2. I'm trying to match the string "xerb Scale" and then extract 2. I tried capturing the group including 2 as (?:xerb Scale:\s\[\")\d{1} and tried to exclude the matched group through a negative look ahead but had no luck.

This is going to be in a SQL query and I'm trying to extract the numerical value through a REGEXP_EXTRACT() function. This query is part of a pipeline that loads this information into the database.

Any help would be much appreciated!

Wiktor Stribiżew · Accepted Answer

You should match what you do not need to obtain in order to set the context for your match, and you need to match and capture what you need to extract:

xerb Scale:\s*\["(\d+)"]
                 ^^^^^

See the regex demo. In Presto, use REGEXP_EXTRACT to get the first match:

SELECT regexp_extract(col, 'xerb Scale:\s*\["(\d+)"]', 1); -- 2
                                                      ^^^

Note the 1 argument:

regexp_extract(string, pattern, group) → varchar
Finds the first occurrence of the regular expression pattern in string and returns the capturing group number group

Extracting a numerical value from a paragraph based on preceding words

Answers (1)

Related Questions