Reputation: 517
I'm currently trying out different kinds of formulas for REGEX EXTRACT, trying to play around and understand it fully. Below will be an example of the data that I'm using and the current code that I'm using to grab what I need. (Please critique my code if it can be written better as I'm still learning REGEX EXTRACT)
Sample_Data
AAAA;BBBB;CCCC;A1=1234;DDDD;EEEE
FFFF;GGGG;A1=2345;A2=4567,2345;RRRR;KKKK
SSSS;TTTT;UUUU;VVVV;A1=3456;GGGG;UUUU
UUUU:WWWW;QQQQ;IIII;A1=9876;A2=7654,7890;UUUU
The current code that I have is:
SELECT
REGEXP_EXTRACT(Sample_Data, r'(?:^|;)A1=(\d*)') AS A1,
REGEXP_EXTRACT(Sample_Date, r'(?:^|;)A2=(\d*)(?:;)') AS A2,
SPLIT(REGEXP_EXTRACT(Sample_Data, r'(?:^|;)A2=(\d*\,\d*)(?:;)'), ",")[offset(1)] AS A2_v1
FROM
db.Sample
The output that I get is:
A1 | A2 | A2_v1
1234 | NULL | NULL
2345 | 4567 | 2345
3456 | NULL | NULL
9876 | 7654 | 7890
With the output it's what I would expect. But, there are a few different questions I have from this, as you can see in the output row 2:
2345 | 4567 | 2345
It has 2345
twice, is there a way to make it so it only shows 2345
once so something like:
2345 | 4567 | NULL
My thought process is to have a CASE WHEN
and have it check the REGEXP_EXTRACT
formulas to see if they match and if they do throw a NULL
instead. Is there a better way of doing this or would this be the best result?
My second question is, lets say we have the following sample data:
AAAA;GGGG;DDDD;A1=1234;A2=7890,1234,3456;DDDD
BBBB;DDDD;CCCC;FFFF;A1=2345;A2=8907,1234,4567,8976;WWWW;GGGG
CCCC;EEEE;A1=6789;A2=34567,8901,3456,12345;TTTT
With the current formulas that I have, it would work to get A1
and a part of A2
only. But, how would I convert the formula to be able to pick up all digits separated by ,
? The end result that I'm looking for is the following:
A1 | A2 | A2_v1 | A2_v2 | A2_v3
1234 | 7890 | 1234 | 3456 | NULL
2345 | 8907 | 1234 | 4567 | 8976
6789 | 34567| 8901 | 3456 | 12345
How would I make this work properly? Would it be a variation of the:
SPLIT(REGEXP_EXTRACT(Sample_Data, r'(?:^|;)A2=(\d*\,\d*)(?:;)'), ",")[offset(1)] AS A2_v1
And have a different offset
? OR is there a different kind of formula that would be capable of doing this?
Any help would be much appreciated!!
Upvotes: 0
Views: 148
Reputation: 2850
To avoid repeating the numbers I think that your idea of CASE ... WHEN
is a good approach. In this case, the IF
conditional can be used as a shorthand. By making the original query a subquery is easier to compare the values.
For A2, in REGEXP_EXTRACT you cannot use more than one match group, so the full digits can be captured by being more permissive in the regex. For example, the regex used:
'A2=([\d,]*)'
Will also match expressions like: A2=1,2,3,4,5
which may or may not should be allowed in your scenario. The regex can be improved to match exactly what you're looking for; however, it will need to be much longer or, it will need to use more than matching group. Example:
'A2=((\d{4},?)+)'
This regex will match one or more sequences of four numbers followed by zero or one commas. To use this regex you can use REGEXP_REPLACE
instead, and keep the desired part while removing everything else. However, this approach seems to complicate things more than simplify them.
Finally, since the number of values in the array may change, I suggest using SAFE_OFFSET
to access the array values, as this will return null values whenever there's an Index Out of Range Error.
You can use the below SQL query as a reference:
SELECT
A1,
IF(A2[SAFE_OFFSET(0)] = A1, NULL, A2[SAFE_OFFSET(0)]) AS A2,
IF(A2[SAFE_OFFSET(1)] = A1, NULL, A2[SAFE_OFFSET(1)]) AS A2_V1,
IF(A2[SAFE_OFFSET(2)] = A1, NULL, A2[SAFE_OFFSET(2)]) AS A2_v2,
IF(A2[SAFE_OFFSET(3)] = A1, NULL, A2[SAFE_OFFSET(3)]) AS A2_v3
FROM (
SELECT
REGEXP_EXTRACT(Sample_Data, r'A1=(\d{4})') as A1,
SPLIT(REGEXP_EXTRACT(Sample_Data, r'A2=([\d,]*)'), ",") AS A2
FROM
(
SELECT 'BBBB;DDDD;CCCC;FFFF;A1=2345;A2=8907,1234,4567,2345;WWWW;GGGG' AS Sample_Data
UNION ALL
SELECT 'CCCC;EEEE;A1=6789;TTTT' AS Sample_Data)
)
Upvotes: 1