Andrei Budaes
Andrei Budaes

Reputation: 665

BigQuery to Snowflake using external tables

I've created a pipeline to export data from BigQuery into Snowflake using Google Cloud services. I'm seeing some issues as the original data source is Firebase. The way Firebase exports analytics data is in a dataset called analytics_{GA_project_id} and the events are sent to a partitioned table called events_{YYYYMMDD}. The pipeline runs smoothly but after a few days I've noticed when running the gap analysis, there is more data in BigQuery (approx 3% at times) than in Snowflake. I've raised a ticket with Firebase and Google Analytics for Firebase and they confirmed there are events that are delayed up to 3 days. I was looking at what other solutions I'd potentially explore and Snowflake can also connect to an External Table hosted in Google Cloud (GCS).

Is there a way to create a replicated table that automatically syncs with BigQuery to push the data in an external table Snowflake can connect with?

The process I've set up relies on a scheduled query to unload any new daily table in GCS which is cost-efficient...I could change the query to have an additional check for new data but every time I'd have to scan the entire 2 months worth of data (or assume only 3 days delay which is not good practice) so this will add a lot more consumption.

I hope there is a more elegant solution.

Upvotes: 1

Views: 1438

Answers (3)

Andrei Budaes
Andrei Budaes

Reputation: 665

I've managed to find a solution by inserting an additional column using a Scheduled Script called gcs_export_timestamp. On every run it checks if there is a new daily table and exports it and then loops through the already exported tables where gcs_export_timestamp is null (meaning new rows). I've described the process here.

Upvotes: 0

Felipe Hoffa
Felipe Hoffa

Reputation: 59335

I think you found a limitation of BigQuery's INFORMATION_SCHEMA.TABLES: It won't tell you when a table was updated last.

For example

SELECT *
FROM `bigquery-public-data`.stackoverflow.INFORMATION_SCHEMA.TABLES

shows that all these tables where created in 2016 — but there's no way to see that they were updated 3 months ago.

I'm looking at the "Create the BigQuery Scheduled Script" you shared at https://github.com/budaesandrei/analytics-pipelines-snowflake/blob/main/firebase-bigquery-gcs-snowflake/Analytics%20Pipeline%20DevOps%20-%20Firebase%20to%20Snowflake.ipynb.

My recommendation is instead of relying on a BigQuery scheduled function, to start doing this outside BigQuery to have access to the whole API. Exports will be cheaper this way too.

For example, the command line bq show bigquery-public-data:stackoverflow.posts_answers will show you the Last modified date for that table.

Let's skip the command line, and let's look at the API directly.

All of these are available in the Python BigQuery SDK. You'll just need to rely on a different scheduler and runtime, but the best news is you'll save a lot in BigQuery exports, as they are mostly free when you use this way instead of inside BigQuery.

Upvotes: 1

Dean Flinter
Dean Flinter

Reputation: 694

I think the issue is just the nature of the GA load to Bigquery process. Google say that same day data is loaded at least 3 times a day (no further detail given) and if there is an error, it is rolled back.

Looking at our own data, there is usually 2 intraday tables on any given day so the last fully loaded date is 2 days ago. If the final intraday load (which changes it from intraday to final table) has an error then there is a possibility that the last full day is from 3 days ago

I'm not familiar with firebase myself but one way to avoid this in BQ directly is to only query the final tables (those in the format ga_sessions_YYYYMMDD)

Upvotes: 0

Related Questions