Eli Baskin
Eli Baskin

Reputation: 106

BigQuery: How to insert a new value in repeated record?

I want to save a history of user statuses.

For this I have a table with two colums: user_identifier and status.

user_identifier is a string, status is a repeated record with key:value pairs: date and status.

When a user changes a status (for example from active to inactive), I would like to update this table and add the new status, while keeping the old one.

This is the table schema:

[
{
"description": "user identifier",
"mode": "REQUIRED",
"name": "user_id",
"type": "STRING"
},
{
"description": "status - can be either sent or pending, initial state is pending",
"mode": "REPEATED",
"name": "status",
"type": "RECORD",
"fields": [
  {
  "name": "status_date",
  "type": "DATE",
  "mode": "REQUIRED"
  },
  {
  "name": "value",
  "type": "STRING",
  "mode": "REQUIRED"
  }
]
}
]

Is this even possible to insert a new user status in this schema? Should I redesign the schema? How to do it properly in BigQuery, to utilize its nesting abilities?

Upvotes: 3

Views: 3331

Answers (1)

Mikhail Berlyant
Mikhail Berlyant

Reputation: 173028

Below is for BigQuery Standard SQL and assumes you have a statuses table project.dataset.statuses as you described in your question and you have updates table project.dataset.updates where you accumulate updates for periodical updates of statuses table

So dummy data could look like

WITH `project.dataset.statuses` AS (
  SELECT 'a' user_id, [STRUCT<status_date DATE, value STRING>('2018-11-03', 'pending')] status UNION ALL
  SELECT 'b', [STRUCT<status_date DATE, value STRING>('2018-11-04', 'pending')] UNION ALL
  SELECT 'c', [] 
), `project.dataset.updates` AS (
  SELECT 'a' user_id, [STRUCT<status_date DATE, value STRING>('2018-11-05', 'sent')] new_statuses UNION ALL
  SELECT 'c', [STRUCT<status_date DATE, value STRING>('2018-11-05', 'pending')]
)

where updates table has exact same schema and represents new updates that needs to be added to main table

Below SELECT returns joined statuses

#standardSQL
SELECT 
  t.user_id, 
  IF(u.user_id IS NULL, status, ARRAY_CONCAT(status, new_statuses)) status
FROM `project.dataset.statuses` t
LEFT JOIN `project.dataset.updates` u
ON t.user_id = u.user_id   

You can use below DDL to "update" statuses table with them

#standardSQL
CREATE OR REPLACE TABLE `project.dataset.statuses` AS
SELECT 
  t.user_id, 
  IF(u.user_id IS NULL, status, ARRAY_CONCAT(status, new_statuses)) status
FROM `project.dataset.statuses` t
LEFT JOIN `project.dataset.updates` u
ON t.user_id = u.user_id   

If to apply to dummy data

statuses:

Row user_id status.status_date  status.value     
1   a       2018-11-03          pending  
2   b       2018-11-04          pending  
3   c             

updates:

Row user_id new_statuses.status_date    new_statuses.value   
1   a       2018-11-05          sent     
3   c       2018-11-05          pending  

result will be as

Row user_id status.status_date  status.value     
1   a       2018-11-03          pending  
            2018-11-05          sent     
2   b       2018-11-04          pending  
3   c       2018-11-05          pending    

In case if updates table can consists of new users that are not yet in main table - below will handle this scenario

#standardSQL
-- CREATE OR REPLACE TABLE `project.dataset.statuses` AS
SELECT 
  IFNULL(t.user_id, u.user_id) user_id,
  CASE 
    WHEN t.user_id = u.user_id THEN ARRAY_CONCAT(status, new_statuses)
    WHEN t.user_id IS NULL THEN new_statuses
    WHEN u.user_id IS NULL THEN status
  END status
FROM `project.dataset.statuses` t
FULL JOIN `project.dataset.updates` u
ON t.user_id = u.user_id   

Upvotes: 2

Related Questions