activelearner
activelearner

Reputation: 7805

BigQuery - Remove specific duplicate records

I have a BigQuery table containing data as below:

date  hits_eventInfo_Category  hits_eventInfo_Action  session_id  user_id  hits_time  hits_eventInfo_Label

20151021  Air  Search  1445001  A232  1952  CurrentLocation
20151021  Air  Search  1445001  A232  1952  CurrentLocation
20151021  Air  Search  1445001  A232  1952  CurrentLocation
20151021  Air  Select  1445001  A232  7380  Vendor
20151021  Air  Select  1445001  A232  7380  Vendor
20151021  Air  Select  1445001  A232  7380  Vendor

As you can see there are a series of duplicate records. I would like to end up with one of the duplicate records from each of the duplicate record sets. For example:

date  hits_eventInfo_Category  hits_eventInfo_Action  session_id  user_id  hits_time  hits_eventInfo_Label

20151021  Air  Search  1445001  A232  1952  CurrentLocation
20151021  Air  Select  1445001  A232  7380  Vendor

How would I do this?

Thanks in advance!

Upvotes: 1

Views: 724

Answers (2)

EMUEVIL
EMUEVIL

Reputation: 512

You could either use a DISTINCT clause, or group your data. These will aggregate the data returned into a single line for each unique entry.

SELECT DISTINCT [date], [hits_eventInfo_Category], [hits_eventInfo_Action], [session_id], [user_id], [hits_time], [hits_eventInfo_Label]
FROM [BigQuery]

--OR

SELECT [date], [hits_eventInfo_Category], [hits_eventInfo_Action], [session_id], [user_id], [hits_time], [hits_eventInfo_Label]
FROM [BigQuery]
GROUP BY [date], [hits_eventInfo_Category], [hits_eventInfo_Action], [session_id], [user_id], [hits_time], [hits_eventInfo_Label]

Note: This won't delete your duplicate data, it just won't be displayed in the results of your select statement. If you would like the duplicate entries deleted permenantely, use @singhsac 's response utilizing window functions.

Upvotes: 3

singhsac
singhsac

Reputation: 401

You can group the duplicates. Keep one row, and delete the remaining from the duplicate group:

Try this (I've assumed table name and other fields)

;WITH rmvDuplicate
        AS (SELECT  ROW_NUMBER() OVER (PARTITION BY [date],  [hits_eventInfo_Category],  [hits_eventInfo_Action],  [session_id],  [user_id],  [hits_time],  [hits_eventInfo_Label]
                                        ORDER BY (SELECT 0)) dup
            FROM    BigQuery_table)
    DELETE FROM rmvDuplicate
    WHERE   dup > 1

Upvotes: 1

Related Questions