How to query a Google BigQuery table and remove duplicates based on a subset of columns?

Question

I have a query that joins two google tables and produces a table with 6 columns (a, b, c, d, e, f). Next, I move that table to a google bucket and then download that google bucket to a bunch of CSV's. Finally I insert those CVS's into a postgres database table which has 2 primary keys, a and b.

The problem is that there is no primary key in the google table and some of the rows have just one value different, so I need to produce a query result that is has no duplicates in a and b. I have tried the following...

SELECT
  DISTINCT A.a,
           A.b,
           A.c AS c, 
           A.d AS d,
           B.c AS e,
           B.d AS f
FROM
  `my_proj.my_table_A` A
FULL OUTER JOIN
  `my_proj.my_table_B` B
ON
  A.a = B.a
  AND A.b = B.b

...but it will still allow a and b to have duplicates if, for example, f is different. I only need one of the duplicate rows and it does not matter which one.

How do I only remove duplicates based only on a and b?

Elliott Brossard · Accepted Answer

Use GROUP BY with ANY_VALUE:

SELECT
  A.a,
  A.b,
  ANY_VALUE(A.c) AS c, 
  ANY_VALUE(A.d) AS d,
  ANY_VALUE(B.c) AS e,
  ANY_VALUE(B.d) AS f
FROM
  `my_proj.my_table_A` A
FULL OUTER JOIN
  `my_proj.my_table_B` B
ON
  A.a = B.a
  AND A.b = B.b
GROUP BY
  A.a,
  A.b

How to query a Google BigQuery table and remove duplicates based on a subset of columns?

Answers (1)

Related Questions