Reputation: 665
I've created a pipeline to export data from BigQuery into Snowflake using Google Cloud services. I'm seeing some issues as the original data source is Firebase. The way Firebase exports analytics data is in a dataset called analytics_{GA_project_id}
and the events are sent to a partitioned table called events_{YYYYMMDD}
. The pipeline runs smoothly but after a few days I've noticed when running the gap analysis, there is more data in BigQuery (approx 3% at times) than in Snowflake. I've raised a ticket with Firebase and Google Analytics for Firebase and they confirmed there are events that are delayed up to 3 days. I was looking at what other solutions I'd potentially explore and Snowflake can also connect to an External Table hosted in Google Cloud (GCS).
Is there a way to create a replicated table that automatically syncs with BigQuery to push the data in an external table Snowflake can connect with?
The process I've set up relies on a scheduled query to unload any new daily table in GCS which is cost-efficient...I could change the query to have an additional check for new data but every time I'd have to scan the entire 2 months worth of data (or assume only 3 days delay which is not good practice) so this will add a lot more consumption.
I hope there is a more elegant solution.
Upvotes: 1
Views: 1438
Reputation: 665
I've managed to find a solution by inserting an additional column using a Scheduled Script called gcs_export_timestamp
. On every run it checks if there is a new daily table and exports it and then loops through the already exported tables where gcs_export_timestamp
is null (meaning new rows). I've described the process here.
Upvotes: 0
Reputation: 59335
I think you found a limitation of BigQuery's INFORMATION_SCHEMA.TABLES
: It won't tell you when a table was updated last.
For example
SELECT *
FROM `bigquery-public-data`.stackoverflow.INFORMATION_SCHEMA.TABLES
shows that all these tables where created in 2016 — but there's no way to see that they were updated 3 months ago.
I'm looking at the "Create the BigQuery Scheduled Script" you shared at https://github.com/budaesandrei/analytics-pipelines-snowflake/blob/main/firebase-bigquery-gcs-snowflake/Analytics%20Pipeline%20DevOps%20-%20Firebase%20to%20Snowflake.ipynb.
My recommendation is instead of relying on a BigQuery scheduled function, to start doing this outside BigQuery to have access to the whole API. Exports will be cheaper this way too.
For example, the command line bq show bigquery-public-data:stackoverflow.posts_answers
will show you the Last modified
date for that table.
Let's skip the command line, and let's look at the API directly.
tables.list
can help you find all tables created after certain date: https://cloud.google.com/bigquery/docs/reference/rest/v2/tables/list
then iterating with table.get
will get you the last modified
for each: https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#Table
then a job.insert
can extract that table to GCS for Snowpipe to automatically import: https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#JobConfigurationExtract
make sure to give the exports of the same table a new name, so Snowpipe can identify as new tables
then you can use METADATA$FILENAME
to deduplicate the imports of updated versions of the same table in Snowflake: https://docs.snowflake.com/en/user-guide/querying-metadata.html
All of these are available in the Python BigQuery SDK. You'll just need to rely on a different scheduler and runtime, but the best news is you'll save a lot in BigQuery exports, as they are mostly free when you use this way instead of inside BigQuery.
Upvotes: 1
Reputation: 694
I think the issue is just the nature of the GA load to Bigquery process. Google say that same day data is loaded at least 3 times a day (no further detail given) and if there is an error, it is rolled back.
Looking at our own data, there is usually 2 intraday tables on any given day so the last fully loaded date is 2 days ago. If the final intraday load (which changes it from intraday to final table) has an error then there is a possibility that the last full day is from 3 days ago
I'm not familiar with firebase myself but one way to avoid this in BQ directly is to only query the final tables (those in the format ga_sessions_YYYYMMDD)
Upvotes: 0