sqlregexstringsnowflake-cloud-data-platform

Reputation: 11

Extracting elements of the array in snowflake using regular expression or other functions

I have the following array from which I want to extract all the values from the elements that start with M2 and sum up only the values of BITC. Example:

[
  "M1 USD 3399.43 BITC 3990.50 MAD 0.00 LOSS -591.07",
  "M2 USD 3144.96 BITC **3399.43** MAD 0.00 LOSS -254.47",
  "M3 USD 1131.78 BITC 3144.96 MAD 0.00 LOSS -2013.18",
  "M1 USD 7.91 BITC 3.92 MAD 0.00 LOSS 3.99",
  "M2 USD 1.47 BITC **7.91** MAD 0.00 LOSS -6.44",
  "M3 USD 2.85 BITC 1.47 MAD 0.00 LOSS 1.38",
  "M1 USD 0.30 BITC 0.30 MAD 0.00 LOSS 0.00",
  "M2 USD 158.21 BITC** 0.30** MAD 0.00 LOSS 157.91",
  "M3 USD 204.17 BITC 158.21 MAD 0.00 LOSS 45.96",
  "M1 USD 0.63 BITC 0.63 MAD 0.00 LOSS 0.00",
  "M2 USD 0.63 BITC **0.63 ** MAD 0.00 LOSS 0.00",
  "M3 USD 0.63 BITC 0.63 MAD 0.00 LOSS 0.00"
]

Need to get the sum of 3399.43+7.90+0.30+0.63

Flexibility would be helpful with some elements in other records not clean. example it can be M2USD instead of M2 USD. Starting with M2 would be helpful

I was trying to use Snowflake SQL to extract all elements and later on flattening them and adding them: but even the first step of extracting does not work

regexp_substr_all(ARRAY_TO_STRING(X, ','), 'M2.*?BITC\\s?(-?\\d+\\.?\\d+)',1,1,'e') as BITC_M2,

results in 
[
  "0.63"
]

Upvotes: 1

Answers (3)

gzzz

Reputation: 368

If all entries are in the strict format 'M2 USD … BITC … MAD … LOSS …', where the BITC-value is always the fifth part, you can try using a simple split:

SELECT SUM(SPLIT_PART(value, ' ', 5)::float) AS bitc_sum
FROM TABLE(FLATTEN(INPUT => data))

Where data is your array with the values from the question.

Upvotes: 0

Simeon Pilgrim

Reputation: 26078

select $1 as inp, 
    regexp_substr($1, 'M2.*BITC\\s+(\\d+\\.\\d+)\\s',1,1,'e',1) as m2
from values
 ('M1 USD 3399.43 BITC 3990.50 MAD 0.00 LOSS -591.07'),
 ('M2 USD 3144.96 BITC 3399.43 MAD 0.00 LOSS -254.47'),
 ('M3 USD 1131.78 BITC 3144.96 MAD 0.00 LOSS -2013.18'),
 ('M1 USD 7.91 BITC 3.92 MAD 0.00 LOSS 3.99'),
 ('M2 USD 1.47 BITC 7.91 MAD 0.00 LOSS -6.44'),
 ('M3 USD 2.85 BITC 1.47 MAD 0.00 LOSS 1.38'),
 ('M1 USD 0.30 BITC 0.30 MAD 0.00 LOSS 0.00'),
 ('M2 USD 158.21 BITC 0.30 MAD 0.00 LOSS 157.91'),
 ('M3 USD 204.17 BITC 158.21 MAD 0.00 LOSS 45.96'),
 ('M1 USD 0.63 BITC 0.63 MAD 0.00 LOSS 0.00'),
 ('M2 USD 0.63 BITC 0.63 MAD 0.00 LOSS 0.00'),
 ('M3 USD 0.63 BITC 0.63 MAD 0.00 LOSS 0.00');

gives:

so that regex works.

now TRANSFORM ARRAY_COMPACT, & REDUCE it.

select parse_json($1) as inp, 
    transform(inp, s text -> regexp_substr(s, 'M2.*BITC\\s+(\\d+\\.\\d+)\\s',1,1,'e',1)) as t,
    array_compact(t) as c,
    reduce(c, 
        0,
       (acc, val) -> acc + val::number(30,2)) as r
from values
 ('[
  "M1 USD 3399.43 BITC 3990.50 MAD 0.00 LOSS -591.07",
  "M2 USD 3144.96 BITC 3399.43 MAD 0.00 LOSS -254.47",
  "M3 USD 1131.78 BITC 3144.96 MAD 0.00 LOSS -2013.18",
  "M1 USD 7.91 BITC 3.92 MAD 0.00 LOSS 3.99",
  "M2 USD 1.47 BITC 7.91 MAD 0.00 LOSS -6.44",
  "M3 USD 2.85 BITC 1.47 MAD 0.00 LOSS 1.38",
  "M1 USD 0.30 BITC 0.30 MAD 0.00 LOSS 0.00",
  "M2 USD 158.21 BITC 0.30 MAD 0.00 LOSS 157.91",
  "M3 USD 204.17 BITC 158.21 MAD 0.00 LOSS 45.96",
  "M1 USD 0.63 BITC 0.63 MAD 0.00 LOSS 0.00",
  "M2 USD 0.63 BITC 0.63 MAD 0.00 LOSS 0.00",
  "M3 USD 0.63 BITC 0.63 MAD 0.00 LOSS 0.00"]');

thus it can be compacted to:

select 
    reduce(array_compact(transform(parse_json($1), s text -> regexp_substr(s, 'M2.*BITC\\s+(\\d+\\.\\d+)\\s',1,1,'e',1))), 
        0,
       (acc, val) -> acc + val::number(30,2)) as r
from values
 ('[
  "M1 USD 3399.43 BITC 3990.50 MAD 0.00 LOSS -591.07",
  "M2 USD 3144.96 BITC 3399.43 MAD 0.00 LOSS -254.47",
  "M3 USD 1131.78 BITC 3144.96 MAD 0.00 LOSS -2013.18",
  "M1 USD 7.91 BITC 3.92 MAD 0.00 LOSS 3.99",
  "M2 USD 1.47 BITC 7.91 MAD 0.00 LOSS -6.44",
  "M3 USD 2.85 BITC 1.47 MAD 0.00 LOSS 1.38",
  "M1 USD 0.30 BITC 0.30 MAD 0.00 LOSS 0.00",
  "M2 USD 158.21 BITC 0.30 MAD 0.00 LOSS 157.91",
  "M3 USD 204.17 BITC 158.21 MAD 0.00 LOSS 45.96",
  "M1 USD 0.63 BITC 0.63 MAD 0.00 LOSS 0.00",
  "M2 USD 0.63 BITC 0.63 MAD 0.00 LOSS 0.00",
  "M3 USD 0.63 BITC 0.63 MAD 0.00 LOSS 0.00"]');

R
3408.27

Striping the new-lines out:

select 
    reduce(array_compact(transform(parse_json($1), s text -> regexp_substr(s, 'M2.*BITC\\s+(\\d+\\.\\d+)\\s',1,1,'e',1))), 
        0,
       (acc, val) -> acc + val::number(30,2)) as r
from values
 ('[ "M1 USD 3399.43 BITC 3990.50 MAD 0.00 LOSS -591.07", "M2 USD 3144.96 BITC 3399.43 MAD 0.00 LOSS -254.47", "M3 USD 1131.78 BITC 3144.96 MAD 0.00 LOSS -2013.18", "M1 USD 7.91 BITC 3.92 MAD 0.00 LOSS 3.99", "M2 USD 1.47 BITC 7.91 MAD 0.00 LOSS -6.44", "M3 USD 2.85 BITC 1.47 MAD 0.00 LOSS 1.38",  "M1 USD 0.30 BITC 0.30 MAD 0.00 LOSS 0.00",  "M2 USD 158.21 BITC 0.30 MAD 0.00 LOSS 157.91", "M3 USD 204.17 BITC 158.21 MAD 0.00 LOSS 45.96",  "M1 USD 0.63 BITC 0.63 MAD 0.00 LOSS 0.00",
  "M2 USD 0.63 BITC 0.63 MAD 0.00 LOSS 0.00", "M3 USD 0.63 BITC 0.63 MAD 0.00 LOSS 0.00"]');

works just the same

Upvotes: 1

rich neadle

Reputation: 383

This regex will help you find the numbers 3399.43+7.91(NOT 7.90)+0.30+0.63 that you are looking for in the string, extra spaces ? and the literal stars \* are also accounted for. Will also capture float or int for extra flexibility.

ISSUE: Greediness* With the original string without newlines characters, there was an issue with the .* being able to match everything in the string without having to stop at the newline. The reason you were getting only one and a wrong match, 0.63, the 0.63 after the last M3 after the last BITC instead of the four correct matches, is because the .* is greedy. The * quantifier will try to match as many characters as possible while still making a match. In your case, it will match the first M2 and then greedily go to the last BITC (in this case the BITC following M3) and make one match. The dot special character . will match any character except a newline character \n.

SOLUTION: Laziness. One solution is to make the 'greedy dot' .* lazy by adding a question mark behind the star quantifier like this .*?. Now, that the dot is lazy, it will do as little work possible, the minimum needed, to make the match. So, in this case, it will attempt the first match using the first BITC following the first M2 that it encounters. There is still a danger here that if it will not get a match with the first BITC it will to try to make a match with the next BITC.

Regex Pattern (Java flavor compatible with SQL):
Edit 2025-02-22: Corrected Regex - typo (Java 8 Regex flavor)

M2[^\"\n$]*?BITC(?: ?\*\*)? ?(-?\d+(?:\.\d+)?)(?: |\*)

* Global (g) flag (= 'Don't return after first match') enabled for multiple matches (This comment was added with edit).

TEST STRING (The original test string without newline characters):

[ "M1 USD 3399.43 BITC 3990.50 MAD 0.00 LOSS -591.07", "M2 USD 3144.96 BITC 3399.43 MAD 0.00 LOSS -254.47", "M3 USD 1131.78 BITC 3144.96 MAD 0.00 LOSS -2013.18", "M1 USD 7.91 BITC 3.92 MAD 0.00 LOSS 3.99", "M2 USD 1.47 BITC 7.91 MAD 0.00 LOSS -6.44", "M3 USD 2.85 BITC 1.47 MAD 0.00 LOSS 1.38", "M1 USD 0.30 BITC 0.30 MAD 0.00 LOSS 0.00", "M2 USD 158.21 BITC** 0.30** MAD 0.00 LOSS 157.91", "M3 USD 204.17 BITC 158.21 MAD 0.00 LOSS 45.96", "M1 USD 0.63 BITC 0.63 MAD 0.00 LOSS 0.00", "M2 USD 0.63 BITC **0.63 ** MAD 0.00 LOSS 0.00", "M3 USD 0.63 BITC 0.63 MAD 0.00 LOSS 0.00

RESULT (Matches 1, 2, 3 and 4. Group 1 value of each match)

Regex Demo: https://regex101.com/r/GuTYdw/11

NOTES:

M2 Literal 'M2'
[^\"\n$]*? Negated class [^...] matches any characters that are not the ones inside the negated character class. Match anything that is not a literal double quote \", a newline \n, or end of string $. I recommend using this so that if something is off with the BITC value, you will not accidentally capture the value from another BITC not associated with the M2. This is also lazy, makes the regex engine work less. Lazy will capture the minimum necessary number of characters to make match. In this case, it is not necessary, but it will make the regex more efficient (less steps).
(?: ?\*\* )? Optional (..)? non-capturing group ?:...) that matches 0 or 1 space character ? followed by two literal star character \*\*. You can use a whitespace character \s which will also match a newline, if that works for you.
? Zero or one space characters.
(...) Capturing group
(-?\d+(?:\.\d+)?) Capture (...) the match (group) that will match an optional dash followed by 1 or more + digits \d, followed by an optional non-capturing group (?:...)? that matches a literal dot \. followed by 1 or more + digits \d
(?: |\*) End this match with non-capturing (?:...) alternation (...|...). Match a space or a literal star * character. (This is an extra check,to makes sure that we have the correct number, a float or int number with the last number of the match that is followed by a space or a star.)

Upvotes: 0

Extracting elements of the array in snowflake using regular expression or other functions

Answers (3)

Related Questions