Sam McVeety
Sam McVeety

Reputation: 3214

How can I de-duplicate records based on a specific column in BigQuery?

I have a table of records that is growing, and I'd like to be able to append modified records to it. However, I'd like to be able to then have a logical view of all of the "newest" versions of each record (highest modified_date + unique primary_key). I tried a JOIN against the table with a GROUP BY primary_key, but this then requires that the entire table have ORDER BY modified_date, which exceeds resources.

Upvotes: 0

Views: 162

Answers (2)

Elliott Brossard
Elliott Brossard

Reputation: 33745

There is now a better way of doing this. Here is an example:

WITH T AS (
  SELECT x, y, MOD(y, 2) AS z
  FROM UNNEST([5, 4, 3, 2]) AS x WITH OFFSET y
)
SELECT
  z,
  ARRAY_AGG(x ORDER BY y LIMIT 1)[OFFSET(0)] AS top_x
FROM T
GROUP BY z;

This returns the top x value as determined by some other column, grouped by a third column. The query in the other answer could be expressed as:

WITH my_table AS (
  SELECT 1 AS primary_key, "foo" AS value, DATE('2016-11-09') AS modified_date UNION ALL
  SELECT 1, "bar", DATE('2016-11-10') UNION ALL
  SELECT 2, "baz", DATE('2016-01-01')
)
SELECT
  row.*
FROM (
  SELECT
    ARRAY_AGG(t ORDER BY modified_date DESC LIMIT 1)[OFFSET(0)] AS row
  FROM my_table AS t
  GROUP BY primary_key
);

This returns the row associated with the most recent modified_date. In theory, you should just be able to use .* directly after [OFFSET(0)] (and not need a subselect), but there appears to be a bug with column resolution that I'm looking into.

Upvotes: 0

Sam McVeety
Sam McVeety

Reputation: 3214

You can avoid the resource explosion by specifying PARTITION BY, which then allows for sorting on a more granular level. This pattern suffices:

SELECT
  *
FROM (
  SELECT
    *,
    ROW_NUMBER() OVER(PARTITION BY primary_key ORDER BY modified_date DESC) seq
  FROM
    my_table)
WHERE
  seq = 1;

Upvotes: 1

Related Questions