Gross
Gross

Reputation: 165

Google BiqQuery: How delete dublicate rows from table without any extra colum

I have table with two columns, where there are duplicates:

select record_id,
       session_id,
       count(record_id) as total_duplicate_records
from `project.dataset.table` 
group by 1,2
order by 3 desc

The output looks like: enter image description here

I saw a similar question: BigQuery Standard SQL: Delete Duplicates from Table but an answer based on some extra column to be used to serve as a tie-breaker. In my case there is no such column...

The goal is to run a scheduled query that will delete all duplicate records once a day

Upvotes: 0

Views: 206

Answers (3)

jeroen
jeroen

Reputation: 84

You can use SELECT DISTINCT to deduplicate your table.

CREATE OR REPLACE TABLE
    `project.dataset.table` -- this will overwrite your table
AS

SELECT DISTINCT
    record_id,
    session_id
FROM
    `project.dataset.table`

Create a new table

Replace the table to be replaced or created with a new name. This will create a new table. Running the query again will recreate/overwrite the new table.

CREATE OR REPLACE TABLE
    `project.dataset.table_deduped` -- change to a new table
AS

SELECT DISTINCT
    record_id,
    session_id
FROM
    `project.dataset.table`

Some potential drawbacks:

  • you are storing your data twice. (The duplicated and deduplicated tables)
  • You have to recreate the table each time you get new data in your original table, however you can instead of recreating it, insert new data via a merge statement

Create a VIEW

A view is a virtual table based on a query. The data is not stored in a new table. You query it the same way you query a table. Can be useful if the underlying table or tables gets updated and you always want an up to date 'view' of the data.

CREATE VIEW
    `project.dataset.table_deduped` -- change to a new table
AS

SELECT DISTINCT
    record_id,
    session_id
FROM
    `project.dataset.table`

Use a cte table (WITH clause)

Create a temporary table that you can reference inside the same query.

WITH 
    deduped_table AS (
        SELECT DISTINCT
            record_id,
            session_id
        FROM
            `project.dataset.table`
)

SELECT 
    *
FROM 
    deduped_table

All the options have their pro's and cons. Depending on what you want to achieve, how your data is structured and how much you need to care about the amount of data your are querying/storing. If you only have the two columns it's probably not that big of a deal but worth noting your options in any case I rate.

Upvotes: 1

Prajna Rai T
Prajna Rai T

Reputation: 1820

Input Table:

enter image description here

If all the rows having all fields as duplicate, you can consider below approach

select as value any_value(t)
from `project.dataset.table1` t
group by format('%t', t)

Output:

enter image description here

To use it with DDL, you can consider below approach:

create or replace table `project.dataset.table2` as
select * from (
 select as value any_value(t)
 from `project.dataset.table1` t
 group by format('%t', t)
);

Output Table:

enter image description here

Upvotes: 1

C.Georgiadis
C.Georgiadis

Reputation: 219

You could use something like this. It will replace the existing table with one row per unique combination of record_id and session_id

CREATE OR REPLACE TABLE
  `project.dataset.table`
AS
SELECT
  record_id, session_id
FROM
  `project.dataset.table`
GROUP BY 1,2

Upvotes: 1

Related Questions