Maykid
Maykid

Reputation: 517

How to use regex extract to grab text between specific characters?

I'm currently trying out different kinds of formulas for REGEX EXTRACT, trying to play around and understand it fully. Below will be an example of the data that I'm using and the current code that I'm using to grab what I need. (Please critique my code if it can be written better as I'm still learning REGEX EXTRACT)

Sample_Data
AAAA;BBBB;CCCC;A1=1234;DDDD;EEEE
FFFF;GGGG;A1=2345;A2=4567,2345;RRRR;KKKK
SSSS;TTTT;UUUU;VVVV;A1=3456;GGGG;UUUU
UUUU:WWWW;QQQQ;IIII;A1=9876;A2=7654,7890;UUUU

The current code that I have is:

SELECT
 REGEXP_EXTRACT(Sample_Data, r'(?:^|;)A1=(\d*)') AS A1,
 REGEXP_EXTRACT(Sample_Date, r'(?:^|;)A2=(\d*)(?:;)') AS A2,
 SPLIT(REGEXP_EXTRACT(Sample_Data, r'(?:^|;)A2=(\d*\,\d*)(?:;)'), ",")[offset(1)] AS A2_v1
FROM
 db.Sample

The output that I get is:

  A1    |   A2    |  A2_v1
1234    |  NULL   |  NULL
2345    |  4567   |  2345
3456    |  NULL   |  NULL
9876    |  7654   |  7890

With the output it's what I would expect. But, there are a few different questions I have from this, as you can see in the output row 2:

2345  |  4567  |  2345

It has 2345 twice, is there a way to make it so it only shows 2345 once so something like:

2345  |  4567  |  NULL

My thought process is to have a CASE WHEN and have it check the REGEXP_EXTRACT formulas to see if they match and if they do throw a NULL instead. Is there a better way of doing this or would this be the best result?

My second question is, lets say we have the following sample data:

AAAA;GGGG;DDDD;A1=1234;A2=7890,1234,3456;DDDD
BBBB;DDDD;CCCC;FFFF;A1=2345;A2=8907,1234,4567,8976;WWWW;GGGG
CCCC;EEEE;A1=6789;A2=34567,8901,3456,12345;TTTT

With the current formulas that I have, it would work to get A1 and a part of A2 only. But, how would I convert the formula to be able to pick up all digits separated by ,? The end result that I'm looking for is the following:

  A1  |  A2  |  A2_v1 | A2_v2 | A2_v3
 1234 | 7890 |  1234  | 3456  | NULL
 2345 | 8907 |  1234  | 4567  | 8976
 6789 | 34567|  8901  | 3456  | 12345

How would I make this work properly? Would it be a variation of the:

SPLIT(REGEXP_EXTRACT(Sample_Data, r'(?:^|;)A2=(\d*\,\d*)(?:;)'), ",")[offset(1)] AS A2_v1

And have a different offset? OR is there a different kind of formula that would be capable of doing this?

Any help would be much appreciated!!

Upvotes: 0

Views: 148

Answers (1)

Tlaquetzal
Tlaquetzal

Reputation: 2850

To avoid repeating the numbers I think that your idea of CASE ... WHEN is a good approach. In this case, the IF conditional can be used as a shorthand. By making the original query a subquery is easier to compare the values.

For A2, in REGEXP_EXTRACT you cannot use more than one match group, so the full digits can be captured by being more permissive in the regex. For example, the regex used:

'A2=([\d,]*)'

Will also match expressions like: A2=1,2,3,4,5 which may or may not should be allowed in your scenario. The regex can be improved to match exactly what you're looking for; however, it will need to be much longer or, it will need to use more than matching group. Example:

'A2=((\d{4},?)+)'

This regex will match one or more sequences of four numbers followed by zero or one commas. To use this regex you can use REGEXP_REPLACE instead, and keep the desired part while removing everything else. However, this approach seems to complicate things more than simplify them.

Finally, since the number of values in the array may change, I suggest using SAFE_OFFSET to access the array values, as this will return null values whenever there's an Index Out of Range Error.

You can use the below SQL query as a reference:

SELECT
  A1,
  IF(A2[SAFE_OFFSET(0)] = A1, NULL, A2[SAFE_OFFSET(0)]) AS A2,
  IF(A2[SAFE_OFFSET(1)] = A1, NULL, A2[SAFE_OFFSET(1)]) AS A2_V1,
  IF(A2[SAFE_OFFSET(2)] = A1, NULL, A2[SAFE_OFFSET(2)]) AS A2_v2,
  IF(A2[SAFE_OFFSET(3)] = A1, NULL, A2[SAFE_OFFSET(3)]) AS A2_v3
FROM (
  SELECT
    REGEXP_EXTRACT(Sample_Data, r'A1=(\d{4})') as A1,
    SPLIT(REGEXP_EXTRACT(Sample_Data, r'A2=([\d,]*)'), ",") AS A2
  FROM
   (
    SELECT 'BBBB;DDDD;CCCC;FFFF;A1=2345;A2=8907,1234,4567,2345;WWWW;GGGG' AS Sample_Data
    UNION ALL
    SELECT 'CCCC;EEEE;A1=6789;TTTT' AS Sample_Data)
)

Upvotes: 1

Related Questions