Reputation: 51
I need to do a merge statement in BigQuery using a classic flat table, having as target a table with nested and repeated fields, and I'm having trouble understanding how this is supposed to work. Google's examples use direct values, so the syntax here is not really clear to me.
Using this example:
CREATE OR REPLACE TABLE
mydataset.DIM_PERSONA (
IdPersona STRING,
Status STRING,
Properties ARRAY<STRUCT<
Id STRING,
Value STRING,
_loadingDate TIMESTAMP,
_lastModifiedDate TIMESTAMP
>>,
_loadingDate TIMESTAMP NOT NULL,
_lastModifiedDate TIMESTAMP
);
INSERT INTO mydataset.DIM_PERSONA
values
('A', 'KO', [('FamilyMembers', '2', CURRENT_TIMESTAMP(), TIMESTAMP(NULL))], CURRENT_TIMESTAMP(), TIMESTAMP(NULL)),
('B', 'KO', [('FamilyMembers', '4', CURRENT_TIMESTAMP(), TIMESTAMP(NULL)),('Pets', '1', CURRENT_TIMESTAMP(), NULL)], CURRENT_TIMESTAMP(), TIMESTAMP(NULL))
;
CREATE OR REPLACE TABLE
mydataset.PERSONA (
IdPersona STRING,
Status STRING,
IdProperty STRING,
Value STRING
);
INSERT INTO mydataset.PERSONA
VALUES('A', 'OK','Pets','3'),('B', 'OK','FamilyMembers','5'),('C', 'OK','Pets','2')
The goal is to:
This INSERT works:
MERGE INTO mydataset.DIM_PERSONA TRG
USING (
SELECT
IdPersona,
Status,
ARRAY(
SELECT AS STRUCT
IdProperty,
Value,
CURRENT_TIMESTAMP(),
TIMESTAMP(NULL)
) Properties,
CURRENT_TIMESTAMP(),
TIMESTAMP(NULL)
FROM mydataset.PERSONA
) SRC ON TRG.IdPersona=SRC.IdPersona
WHEN NOT MATCHED THEN
INSERT VALUES (IdPersona, Status, Properties, CURRENT_TIMESTAMP(), TIMESTAMP(NULL))
But I would like to build the nested/repeated fields in the INSERT clause, because for the UPDATE I would also need (I think) to do a "SELECT AS STRUCT * REPLACE" by comparing the values of TRG with SRC. This doesn't work:
MERGE INTO mydataset.DIM_PERSONA TRG
USING (
SELECT
*
FROM mydataset.PERSONA
) SRC ON TRG.IdPersona=SRC.IdPersona
WHEN NOT MATCHED THEN
INSERT VALUES (
IdPersona,
Status,
ARRAY(
SELECT AS STRUCT
IdProperty,
Value,
CURRENT_TIMESTAMP(),
TIMESTAMP(NULL)
),
CURRENT_TIMESTAMP(),
TIMESTAMP(NULL)
)
I get "Correlated Subquery is unsupported in INSERT clause."
Even if I used the first option, I don't get how to reference TRG.properties in the UPDATE..
WHEN MATCHED THEN
UPDATE
SET Properties = ARRAY(
SELECT AS STRUCT p_SRC.*
REPLACE (IF(p_SRC.IdProperty=p_TRG.id AND p_SRC.Value<>p_TRG.Value,p_SRC.Value,p_TRG.Value) AS Value)
FROM SRC.Properties p_SRC, TRG.Properties p_TRG
)
Obv this is wrong though.
One way to solve this, as I see it, is to pre-join everything in the USING clause, therefore doing all the replacement there, but it feels very wrong for a merge statement.
Can anyone help me figure this out, please? :\
Upvotes: 1
Views: 5041
Reputation: 4051
Also, while you did not provide your desired output, I was able to create a query based on the objectives you described and your code and with the sample data you provided.
Following the below goals:
- Update IdPersona='A', adding a new element in Properties and changing Status
- Update IdPersona='B', updating the existent element in Properties
- Insert IdPersona='C'
Instead of doing a replace and rebuild operation, I used:
Below is the query with the proper comments:
#first step update current values and insert new IdPersonas
MERGE sample.DIM_PERSONA_test2 T
USING sample.PERSONA_test2 S
ON T.IdPersona = S.IdPersona
#update A but not insert
WHEN MATCHED AND T.IdPersona ="A" THEN
UPDATE SET STATUS = "OK"
#update B
WHEN MATCHED AND T.IdPersona ="B" THEN
UPDATE SET Properties = [( S.IdPersona, S.IdProperty,TIMESTAMP(NULL), TIMESTAMP(NULL) )]
#insert what is not in the target table
WHEN NOT MATCHED THEN
INSERT(IdPersona, Status , Properties, _loadingDate, _lastModifiedDate ) VALUES (S.IdPersona, S.Status, [( IdProperty,Value, TIMESTAMP(NULL), TIMESTAMP(NULL))], CURRENT_TIMESTAMP(), TIMESTAMP(NULL));
#insert new values when IdPersona="A"
#you will see the result won't be aggregated properly
INSERT INTO sample.DIM_PERSONA_test2(IdPersona, Status , Properties, _loadingDate, _lastModifiedDate)
SELECT IdPersona, Status,[( IdProperty,Value, TIMESTAMP(NULL), TIMESTAMP(NULL))], CURRENT_TIMESTAMP(), TIMESTAMP(NULL) from sample.PERSONA_test2
where IdPersona = "A";
#replace the above table to recriate the ARRAY<STRUCT<>>
CREATE OR REPLACE TABLE sample.DIM_PERSONA_FINAL_test2 AS(
SELECT t1.*, t2._loadingDate,t2._lastModifiedDate
FROM( SELECT a.IdPersona,
a.Status,
ARRAY_AGG(STRUCT( Properties.Id as Id, Properties.Value as Value, Properties._loadingDate ,
Properties._lastModifiedDate AS _lastModifiedDate)) AS Properties
FROM sample.DIM_PERSONA_test2 a, UNNEST(Properties) as Properties
GROUP BY 1,2
ORDER BY a.IdPersona)t1 LEFT JOIN sample.DIM_PERSONA_test2 t2 USING(IdPersona)
)
And the output,
Notice that when updating the ARRAY<STRUCT<>>, the values are wrapped within [()]. Lastly, pay attention that there are two IdPersona="A" because _loadingDate is required, so it can not be NULL and due to the CURRENT_TIMESTAMP(), there are two different values for this field. Thus, two different records.
Upvotes: 0
Reputation: 51
So, I wanted to share a possible solution, although I still hope there's another way. As mentioned, I pre-compute what I need with a CTE and a FULL OUTER JOIN, therefore recreating the array of structs I need later on (tables will be relatively small so I can afford it).
MERGE INTO mydataset.DIM_PERSONA TRG
USING (
WITH NEW_PROPERTIES AS (
SELECT
COALESCE(idp,IdPersona) IdPersona,
ARRAY_AGG((
SELECT AS STRUCT
COALESCE(idpro,Id) IdProperty,
COALESCE(vl,Value) Value,
COALESCE(_loadingDate,CURRENT_TIMESTAMP) _loadingDate,
IF(idp=IdPersona,CURRENT_TIMESTAMP,TIMESTAMP(NULL)) _lastModifiedDate
)) Properties
FROM (
SELECT DIP.IdPersona, DIP.Status, DIP_PR.*, PER.IdPersona idp, PER.Status st, PER.IdProperty idpro, PER.Value vl
FROM `clean-yew-281811.mydataset.DIM_PERSONA` DIP
CROSS JOIN UNNEST(DIP.Properties) DIP_PR
FULL OUTER JOIN mydataset.PERSONA PER
ON DIP.IdPersona=PER.IdPersona
AND DIP_PR.Id=PER.IdProperty
)
GROUP BY IdPersona
)
SELECT
IdPersona,
'subquery to do here' Status,
NP.Properties
FROM (SELECT DISTINCT IdPersona FROM mydataset.PERSONA) PE
LEFT JOIN NEW_PROPERTIES NP USING (IdPersona)
) SRC ON TRG.IdPersona=SRC.IdPersona
WHEN NOT MATCHED THEN
INSERT VALUES (IdPersona, Status, Properties, CURRENT_TIMESTAMP(), TIMESTAMP(NULL))
WHEN MATCHED THEN
UPDATE
SET
TRG.Status = SRC.Status,
TRG.Properties = SRC.Properties,
TRG._lastModifiedDate = CURRENT_TIMESTAMP()
This works but I'm pretty much avoiding the syntax to update an array of structs, as what I'm doing is a rebuild and replace operation. Hopefully someone can suggest a better way.
Upvotes: 0