swm
swm

Reputation: 539

Store only 1 values and remove the rest for same duplicated values in bigquery

I have duplicated values in my data. However, from the duplicated values, i only want to store 1 values and remove the rest of same duplicated values. So far, I have found the solution where they remove ALL the duplicated values like this.

posts

Code:

  SELECT ID, a.date as date.A, b.date as date.B, 
       CASE WHEN a.date <> b.date THEN NULL END AS b.date
except(date.A)

FROM
    table1 a LEFT JOIN table2 b
 USING (ID)
WHERE date.A = 1

Sample input:

enter image description here

Sample output (Store only 1 values from the duplicated values and remove the rest):

![enter image description here

NOTE: query might wrong as it remove all duplicated values.

Upvotes: 0

Views: 379

Answers (2)

Gordon Linoff
Gordon Linoff

Reputation: 1269773

You can use aggregation. Something like this:

SELECT ANY_VALUE(a).*, ANY_VALUE(b).*
FROM table1 a LEFT JOIN
     table2 b
     USING (ID)
WHERE date.A = 1
GROUP BY id, a.date;

For each id/datecombination, this returns an arbitrary matching row froma/b`.

Upvotes: 1

Alexandre Moraes
Alexandre Moraes

Reputation: 4042

Considering your screenshot's sample data and your explanation. I understand that you want to remove duplicates from your table retaining only one row of unique data. Thus, I was able to create a query to select only one row of data ignoring the duplicates.

In order to select the rows without any duplicates, you can use SELECT DISTINCT. According to the documentation, it discards any duplicate rows. In addition to this method, CREATE TABLE statement will also be used to create a new table (or replace the previous one) with the new data without duplicates. The syntax is as follows:

CREATE OR REPLACE TABLE project_id.dataset.table AS
SELECT DISTINCT ID, a.date as date.A, b.date as date.B, 
       CASE WHEN a.date <> b.date THEN NULL END AS b.date
except(date.A)

FROM
    table1 a LEFT JOIN table2 b
USING (ID)
WHERE date.A = 1

And the output will be exactly the same as you shared in your question.

Notice that I used CREATE OR REPLACE, which means if you set project_id.dataset.table to the same path as the table within your select, it will replace your current table (in case you have the data coming from one unique table). Otherwise, it will create a new table with the specified new table's name.

Upvotes: 1

Related Questions