BigQuery : how to build a new row based on duplicates

Question

In BigQuery, I got this type of data below :

#standardSQL
WITH name_table AS (
  SELECT 'a' id, 1 hitnumber, 'alpha' page UNION ALL
  SELECT 'a', 2, 'beta' UNION ALL
  SELECT 'a', 3, 'beta' UNION ALL
  SELECT 'a', 4, 'alpha' UNION ALL 
  SELECT 'a', 5, 'beta' UNION ALL
  SELECT 'b', 1, 'gamma' UNION ALL
  SELECT 'b', 2, 'gamma'
 )
SELECT *
FROM name_table

 Output :
    id| hitnumber|  page 
    a |   1      |  alpha
    a |   2      |  beta 
    a |   3      |  beta 
    a |   4      |  alpha
    a |   5      |  beta
    b |   1      |  gamma
    b |   2      |  gamma

And I want to build a new hitnumber based on duplicate pages as below.

id| hitnumber|  page |  new_hitnumber
a |   1      |  alpha|  1
a |   2      |  beta |  2
a |   3      |  beta |  2
a |   4      |  alpha|  3
a |   5      |  beta |  4
b |   1      |  gamma|  1
b |   2      |  gamma|  1

Or if it's possible to deduplicate directly and obtain :

id| page |  new_hitnumber
a | alpha|  1
a | beta |  2
a | alpha|  3
a | beta |  4
b | gamma|  1

I tried with ROW_NUMBER() or RANK() but without success.

Many thanks in advance for your help.

Arnaud

Mikhail Berlyant · Accepted Answer

want to build a new hitnumber based on duplicate pages as below.

Use below approach

select * except(new_group),
  countif(ifnull(new_group, true)) over(partition by id order by hitnumber) new_hitnumber
from (
  select *, 
    page != lag(page) over(partition by id order by hitnumber) new_group
  from name_table
)
# order by id, hitnumber

if applied to sample data in your question - output is

Or if it's possible to deduplicate directly

select * except(hitnumber) from (
  select * except(new_group),
    countif(ifnull(new_group, true)) over(partition by id order by hitnumber) new_hitnumber
  from (
    select *, 
      page != lag(page) over(partition by id order by hitnumber) new_group
    from name_table
  )
)
where true 
qualify row_number() over(partition by id, page, new_hitnumber order by hitnumber) = 1
order by id, new_hitnumber

in this case - output is

BigQuery : how to build a new row based on duplicates

Answers (1)

Related Questions