Reputation: 774
I have two tables under dataset: my_dataset
in BigQuery
Data under these tables is populated via Dataflow job running in GCP.
In event_attributes
table there are 4 distinct values of attribute_number: [1,2,3,4].
For a given events_fk
value, there can maximum 4 rows in event_attributes
table i.e. value of all_count in below query can be max 4.
select events_fk, count(*) as all_count from events_attributes group by events_fk order by all_count limit 1
Now, I want to create a materialized view, which has count of all events, count of completed events (from events table) grouped by time, app, completed, attribute_value(from event_attributes table). For this, I'm trying to use below query
create materialized view my_dataset.test_mv as (
select
extract(DATE FROM timestamp_seconds(epoch_seconds_utc)) as date,
extract(HOUR FROM timestamp_seconds(epoch_seconds_utc)) as hour,
app,
completed,
attribute_value,
count(*) as all_count,
sum(cast(completed as tinyint)) as completed_count
from `my_dataset.events` c
left join unnest(
(
select array_agg( struct(id, events_fk, attribute_number, attribute_value) ) as eattr
from `my_dataset.event_attributes`
)
) a
on c.id = a.events_fk
group by date, hour, app, completed, attribute_value
)
Since, left join is not supported in creating materialized view as of now, I was trying to use unnest() and provide an array arguments inside that as a work around. But, bigquery complains about this with below error
I'm guessing it's because of the subquery that I'm using under unnest().
Reason for trying to create materialized view: I want to use the created materialized view for constructing different reports.
My requirements are-
I've also thought of having combined data of these two tables in the dataflow stage itself. But that will be a long route as of now. Since, I will have to think about the changes in dataflow job.
Question: Need help with the create materialized view query. If query is not possible what should be the best course of action in such situation? Any suggestions will be very helpful.
Thanks in advance!!
Edit: Sample data can be constructed via below queries
create table my_dataset.events (
id INTEGER,
epoch_seconds_utc INTEGER,
app STRING,
engaged BOOLEAN,
completed BOOLEAN
);
create table my_dataset.event_attributes (
id INTEGER,
events_fk INTEGER,
attribute_number STRING,
attribute_value STRING
)
-- POPULATE EVENTS DATA
insert into `my_dataset.events` values (435515, 1660807131, 'airtel', FALSE, FALSE);
insert into `my_dataset.events` values (435517, 1660807514, 'vodafone', FALSE, FALSE);
insert into `my_dataset.events` values (285111, 1607927586, 'telstra', FALSE, FALSE);
insert into `my_dataset.events` values (435051, 1649137598, 'healthways', FALSE, FALSE);
insert into `my_dataset.events` values (435017, 1649068151, 'tangoe', FALSE, FALSE);
insert into `my_dataset.events` values (435519, 1660821334, 'airtel', FALSE, FALSE);
insert into `my_dataset.events` values (703, 1589840471, 'vodafone', FALSE, FALSE);
insert into `my_dataset.events` values (1619, 1591014513, 'telstra', FALSE, FALSE);
insert into `my_dataset.events` values (783, 1589840537, 'tangoe', FALSE, FALSE);
insert into `my_dataset.events` values (20245, 1596825490, 'healthways', FALSE, FALSE);
-- POPULATE EVENT ATTRIBUTES DATA
insert into `my_dataset.event_attributes` values (6341, 435017, '1', 'In Progress');
insert into `my_dataset.event_attributes` values (6381, 435051, '1', 'In Progress');
insert into `my_dataset.event_attributes` values (5429, 285111, '1', '9191919191');
insert into `my_dataset.event_attributes` values (5431, 285111, '2', 'Direct Connection');
insert into `my_dataset.event_attributes` values (5433, 285111, '3', 'English');
insert into `my_dataset.event_attributes` values (5435, 285111, '4', 'Covid');
insert into `my_dataset.event_attributes` values (7161, 435517, '3', 'English');
insert into `my_dataset.event_attributes` values (7159, 435517, '2', 'Some value');
insert into `my_dataset.event_attributes` values (7151, 435515, '2', 'Some value');
insert into `my_dataset.event_attributes` values (7163, 435517, '4', 'Default');
insert into `my_dataset.event_attributes` values (7157, 435517, '1', '50000');
insert into `my_dataset.event_attributes` values (7149, 435515, '1', '50000');
insert into `my_dataset.event_attributes` values (7153, 435515, '3', 'English');
insert into `my_dataset.event_attributes` values (7155, 435515, '4', 'Default');
In this if you perform an inner join using query like
select e.id, ea.attribute_number, ea.attribute_value
from my_dataset.events e left join my_dataset.event_attributes ea on e.id=ea.events_fk
I get around 14 rows, and if I use left join in above query, I get around 19 rows because there are records in events table which don't have corresponding entry in event_attributes.
Desired output: want to get all rows of events table considered to create materialized view.
Let me know if any information is needed from my end.
Upvotes: 1
Views: 3167
Reputation: 1201
As Consulted with a Googler:
Material view is good when writing to source is not as much as reading from the sources and having a pre-processed data will save time or cost, or when incremental refresh would benefit. [1]
With left join not supported, you may consider following:
For comparison of other resource types, refer to [2].
[1] https://cloud.google.com/bigquery/docs/materialized-views-intro#use_cases
[2] https://cloud.google.com/bigquery/docs/materialized-views-intro#comparison
Upvotes: 1