Afanasii Kurakin
Afanasii Kurakin

Reputation: 3459

Discard rows with non-consecutive numbers in Google BigQuery Standard SQL

I have dataset with field containing repeating numbers sequence (1 - 5 for brevity), and some rows with the number out of sequence:

1
2
3
5 -- out of sequence, need to discard
4
5
1 -- sequence starts over
2
3
4
...

How do I discard rows that are out of sequence? Thanks!

UPDATE: There is other column to specify the ordering.

UPDATE2: Dataset for testing:

WITH t AS (
  SELECT * FROM
  UNNEST([
    STRUCT(1 AS id, 1 AS n),
    (2, 2),
    (3, 3),
    (4, 5),
    (5, 4),
    (6, 5),
    (7, 1),
    (8, 2),
    (9, 3),
    (10, 4)
  ])
)

UPDATE3: There can be many numbers out of order (from the same 1 - 5 range). The sequence always starts with 1, and all numbers are present, except the last round, which can be incomplete (ending earlier, see the test set). The out-of-order numbers are like a "noise" that needs to be removed.

Upvotes: 1

Views: 226

Answers (2)

Gordon Linoff
Gordon Linoff

Reputation: 1269883

Let me assume that you have another column to specify the ordering. If not, the problem is not well-defined.

If so, you can use lag():

select t.*
from (select t.*, lag(col) over (order by row) as prev_col
      from t
     ) t
where (prev_col is null) or 
      (prev_col = col - 1) or
      (prev_col = 5 and col = 1);

The ? is for the column that specifies the ordering.

Upvotes: 3

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172993

Below is for BigQuery Standard SQL and with use of JS UDF
Tt returns all found sequences started with 1 and with following consecutive numbers

#standardSQL
CREATE TEMPORARY FUNCTION extract_sequence(arr ARRAY<STRUCT<id INT64, n INT64>>) 
RETURNS ARRAY<STRUCT<id INT64, n INT64>>
LANGUAGE js AS """
  target = [1,2,3,4,5];
  var result = [];
  j = 0;
  for (i = 0; i < arr.length; i++) { 
    if (arr[i].n == target[j]) {
      x = [];
      x.id = arr[i].id;
      x.n = arr[i].n;
      result.push(x);
      j++
    } 
  }
  return result;
""";
WITH t AS (
  SELECT * 
  FROM UNNEST([
    STRUCT(1 AS id, 1 AS n), (2, 2), (3, 3), (4, 5), (5, 4), (6, 5), (7, 1), (8, 2), (9, 3), (10, 4)
  ])
)
SELECT elem.id, elem.n, grp
FROM (
  SELECT grp, extract_sequence(ARRAY_AGG(STRUCT(id, n) ORDER BY id)) arr
  FROM (
    SELECT id, n, COUNTIF(n = 1) OVER(ORDER BY id) grp
    FROM t
  )
  GROUP BY grp
), UNNEST(arr) elem
ORDER BY id  

with result as expected:

Row id  n   grp  
1   1   1   1    
2   2   2   1    
3   3   3   1    
4   5   4   1    
5   6   5   1    
6   7   1   2    
7   8   2   2    
8   9   3   2    
9   10  4   2      

Hope you will adjust to you specific case

Upvotes: 2

Related Questions