Kim
Kim

Reputation: 3

Remove duplicates from an array in BigQuery

I work within BigQuery, and one of my tables has an array that has duplication that I want to figure out how to remove. I'm able to identify the problem rows using a CTE, but actually removing them is where I get stuck.

Let me provide an example to give more context:

Order Item.Sku Item.Quantity Item.Price Shipment.Ship_Number
123 ABC 2 5.99 UPS123
ABC 2 5.99 UPS234
XYZ 1 19.99
456 ABC 2 5.99 UPS456
789 XYZ 1 19.99 UPS789

So looking at this table (for code, going to call it Order_Table), I would want to delete only the second "row" within the Item array for order 123 - since the sku, quantity, and price are all exactly the same. Although Orders 456 and 789 have the same information in the Item array, they are not considered duplicates because they have different order numbers. I also included an additional array, Shipment, because I'm dealing with a table with more than one array, so I want to make sure any solution takes that into account. So after duplicate deletion, I'd like to end up with this:

Order Item.Sku Item.Quantity Item.Price Shipment.Ship_Number
123 ABC 2 5.99 UPS123
XYZ 1 19.99 UPS234
456 ABC 2 5.99 UPS456
789 XYZ 1 19.99 UPS789

Any ideas how to get there? If you have any questions feel free to ask and I'd be happy to provide more context. Thanks!

EDIT: This is how I identified the problem rows using a CTE:

select * from
(select item.*
,row_number() over (
  partition by 
    order,
    item.sku,
    item.quantity,
    item.price
  order by item.sku)
    as row_id
    from Order_Table t, t.Item item)
  where row_id > 1

Upvotes: 0

Views: 9104

Answers (1)

joell
joell

Reputation: 426

If you want to deduplicate an array in BigQuery, 1) you first have to flatten the array. 2) Subsequently, deduplication can be done as usual (e.g. using distinct or grouping etc.) 3) Finally, to return an array (as the original schema), we can group on the primary key and use ARRAY_AGG.

1) Flattening of the array is done using the UNNEST operator.

The UNNEST operator takes an ARRAY and returns a table, with one row for each element in the ARRAY.

In combination with a CROSS JOIN, we retrieve a row for each array record. (Note, the comma implies a CROSS JOIN, see docs)

SELECT 
  order_id,
  item.sku AS item_sku,
  item.price AS item_price
FROM data, UNNEST(items) AS item

Results after flattening

2) Using DISTINCT we retrieve unique records.

SELECT DISTINCT
  order_id,
  item.sku AS item_sku,
  item.price AS item_price
FROM data, UNNEST(items) AS item

3) We can group on the primary key (order_id) and use ARRAY_AGG to return the results as array. I.e., returning the original schema.

SELECT 
  order_id,
  ARRAY_AGG(
    STRUCT(
      item_sku AS sku,
      item_price AS price
    ) 
  ) AS item
FROM flattened
GROUP BY order_id

Results after grouping.


Full reproducible example

WITH data AS (
  SELECT 
    123 AS order_id,
    [
      STRUCT(
        'ABC' AS sku,
        5.99 AS price
      ),
      STRUCT(
        'ABC' AS sku,
        5.99 AS price
      ),
      STRUCT(
        'XYZ' AS sku,
        19.99 AS price
      )
    ] AS items,
  
  UNION ALL
  
  SELECT 
    456, 
    [
      STRUCT(
        'ABC' AS sku,
        5.99 AS price
      )
    ]
    
  UNION ALL
  
  SELECT 
    789, 
    [
      STRUCT(
        'XYZ' AS sku,
        19.99 AS price
      )
    ]
),

flattened AS (
  SELECT DISTINCT
    order_id,
    item.sku AS item_sku,
    item.price AS item_price
  FROM data, UNNEST(items) AS item
)

SELECT 
  order_id,
  ARRAY_AGG(
    STRUCT(
      item_sku AS sku,
      item_price AS price
    ) 
  ) AS item
FROM flattened
GROUP BY order_id

Upvotes: 4

Related Questions