Reputation: 3
I work within BigQuery, and one of my tables has an array that has duplication that I want to figure out how to remove. I'm able to identify the problem rows using a CTE, but actually removing them is where I get stuck.
Let me provide an example to give more context:
Order | Item.Sku | Item.Quantity | Item.Price | Shipment.Ship_Number |
---|---|---|---|---|
123 | ABC | 2 | 5.99 | UPS123 |
ABC | 2 | 5.99 | UPS234 | |
XYZ | 1 | 19.99 | ||
456 | ABC | 2 | 5.99 | UPS456 |
789 | XYZ | 1 | 19.99 | UPS789 |
So looking at this table (for code, going to call it Order_Table), I would want to delete only the second "row" within the Item array for order 123 - since the sku, quantity, and price are all exactly the same. Although Orders 456 and 789 have the same information in the Item array, they are not considered duplicates because they have different order numbers. I also included an additional array, Shipment, because I'm dealing with a table with more than one array, so I want to make sure any solution takes that into account. So after duplicate deletion, I'd like to end up with this:
Order | Item.Sku | Item.Quantity | Item.Price | Shipment.Ship_Number |
---|---|---|---|---|
123 | ABC | 2 | 5.99 | UPS123 |
XYZ | 1 | 19.99 | UPS234 | |
456 | ABC | 2 | 5.99 | UPS456 |
789 | XYZ | 1 | 19.99 | UPS789 |
Any ideas how to get there? If you have any questions feel free to ask and I'd be happy to provide more context. Thanks!
EDIT: This is how I identified the problem rows using a CTE:
select * from
(select item.*
,row_number() over (
partition by
order,
item.sku,
item.quantity,
item.price
order by item.sku)
as row_id
from Order_Table t, t.Item item)
where row_id > 1
Upvotes: 0
Views: 9104
Reputation: 426
If you want to deduplicate an array in BigQuery, 1) you first have to flatten the array. 2) Subsequently, deduplication can be done as usual (e.g. using distinct or grouping etc.) 3) Finally, to return an array (as the original schema), we can group on the primary key and use ARRAY_AGG
.
1) Flattening of the array is done using the UNNEST operator.
The UNNEST operator takes an ARRAY and returns a table, with one row for each element in the ARRAY.
In combination with a CROSS JOIN
, we retrieve a row for each array record. (Note, the comma implies a CROSS JOIN
, see docs)
SELECT
order_id,
item.sku AS item_sku,
item.price AS item_price
FROM data, UNNEST(items) AS item
2) Using DISTINCT
we retrieve unique records.
SELECT DISTINCT
order_id,
item.sku AS item_sku,
item.price AS item_price
FROM data, UNNEST(items) AS item
3) We can group on the primary key (order_id
) and use ARRAY_AGG to return the results as array. I.e., returning the original schema.
SELECT
order_id,
ARRAY_AGG(
STRUCT(
item_sku AS sku,
item_price AS price
)
) AS item
FROM flattened
GROUP BY order_id
Full reproducible example
WITH data AS (
SELECT
123 AS order_id,
[
STRUCT(
'ABC' AS sku,
5.99 AS price
),
STRUCT(
'ABC' AS sku,
5.99 AS price
),
STRUCT(
'XYZ' AS sku,
19.99 AS price
)
] AS items,
UNION ALL
SELECT
456,
[
STRUCT(
'ABC' AS sku,
5.99 AS price
)
]
UNION ALL
SELECT
789,
[
STRUCT(
'XYZ' AS sku,
19.99 AS price
)
]
),
flattened AS (
SELECT DISTINCT
order_id,
item.sku AS item_sku,
item.price AS item_price
FROM data, UNNEST(items) AS item
)
SELECT
order_id,
ARRAY_AGG(
STRUCT(
item_sku AS sku,
item_price AS price
)
) AS item
FROM flattened
GROUP BY order_id
Upvotes: 4