xiaochuan fang
xiaochuan fang

Reputation: 114

BigQuery: delete duplicated row that are not fully duplicated (delete desire row)

I have a table recording customer step on daily basis. The table had Id, date and step column. Some rows contained different steps on the same day for the same Id. Sample as shown below on 5/3/2020 and 5/4/2020 for Id 1: | Id | Date | Step | |:-----|:---------|:-----| | 1 | 5/1/2020 | 1 | | 1 | 5/2/2020 | 1 | | 1 | 5/3/2020 | 0 | | 1 | 5/3/2020 | 5 | | 1 | 5/4/2020 | 2 | | 1 | 5/4/2020 | 10 | | 1 | 5/5/2020 | 1 | | 2 | 5/1/2020 | 1 | | 2 | 5/2/2020 | 2 | | 2 | 5/3/2020 | 0 |

I want to delete rows that contain lesser step, which is 5/3/2020 for 0 step, 5/4/2020 for 2 step for Id 1.

I had tried using row_number() like this:

SELECT
  Id,
  Date,
  step,
  ROW_NUMBER() OVER (PARTITION BY Id, Date ORDER BY Id, Date) AS rn
FROM
  `dataset.step`
WHERE rn>1

But that will give me rows with higher step, which is not want I want. I also able to select rows with fewer step like this:

SELECT * FROM
  `dataset.step` AS A
  INNER JOIN
  `dataset.step` AS B
  ON A.Id = B.Id
  AND A.Date = B.Date
  WHERE A.step < B.step 

But find no way to use it for delete.

Upvotes: 0

Views: 76

Answers (1)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 173210

Use below approach

select *
from your_table
qualify 1 = row_number() over win
window win as (partition by id, date order by step desc)    

if applied to sample data in your question - output is

enter image description here

Upvotes: 2

Related Questions