James Harrington
James Harrington

Reputation: 103

Regex that matches strings with specific text not between text in BigQuery

I have the following strings:

step_1->step_2->step_3
step_1->step_3
step_1->step_2->step_1->step_3
step_1->step_2->step_1->step_2->step_3

What I would like to do is to capture the ones that between step_1 and step 3 there's no step_2.

The results should be like this:

string                                   result
step_1->step_2->step_3                    false
step_1->step_3                             true
step_1->step_2->step_1->step_3             true
step_1->step_2->step_1->step_2->step_3    false

I have tried to use the negative lookahead but I found out that BigQuery doesn't support it. Any ideas?

Upvotes: 0

Views: 1333

Answers (3)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 173210

Consider also below option

select string, 
  not regexp_contains(string, r'step_1->(step_2->)+step_3\b') as result
from your_table

Upvotes: 1

Cylldby
Cylldby

Reputation: 1978

I believe @Daniel_Zagales answer is the one you were expecting. However here is a broader solution that can maybe be interesting in your usecase:it consists in using arrays

 WITH sample AS (
  SELECT 'step_1->step_2->step_3' AS path
  UNION ALL SELECT 'step_1->step_3'
  UNION ALL SELECT 'step_1->step_2->step_1->step_3'
  UNION ALL SELECT 'step_1->step_2->step_1->step_2->step_3'
),
  temp AS (
  SELECT
    path,
    SPLIT(REGEXP_REPLACE(path,'step_', ''), '->') AS sequences
  FROM
    sample)
SELECT
  path,
  position,
  flattened AS current_step,
  LAG(flattened) OVER (PARTITION BY path ORDER BY OFFSET ) AS previous_step,
  LEAD(flattened) OVER (PARTITION BY path ORDER BY OFFSET ) AS following_step
FROM
  temp,
  temp.sequences AS flattened
WITH
OFFSET AS position
 

This query returns the following table

enter image description here

The concept is to get an array of the step number (splitting on '->' and erasing 'step_') and to keep the OFFSET (crucial as UNNESTing arrays does not guarantee keeping the order of an array).

The table obtained contains for each path and step of said path, the previous and following step. It is therefore easy to test for instance if successive steps have a difference of 1.

(SELECT * FROM <previous> WHERE ABS(current_step-previous_step) != 1 for example)

(CASTing to INT required)

Upvotes: 0

Daniel Zagales
Daniel Zagales

Reputation: 3032

You are essentially looking for when the pattern does not exist. The following regex would support that embedded in a case statement. This would not support a scenario where you have both conditions in a single string, however that was not a scenario you listed in your sample data.

Try the following:

with sample_data as (
    select 'step_1->step_2->step_3' as string union all
    select 'step_1->step_3' union all
    select 'step_1->step_2->step_1->step_3'  union all
    select 'step_1->step_2->step_1->step_2->step_3' union all 
    select 'step_1->step_2->step_1->step_2->step_2->step_3' union all 
    select 'step_1->step_2->step_1->step_2->step_2' 
)

select 
    string,
    -- CASE WHEN regexp_extract(string, r'step_1->(\w+)->step_3') IS NULL THEN TRUE
    CASE WHEN regexp_extract(string, r'1(->step_2)+->step_3') IS NULL THEN TRUE
    ELSE FALSE END as result
from sample_data 

This results in:

enter image description here

Upvotes: 1

Related Questions