Reputation: 23
currently scraping data and dumping them on a cloudSQL postgres database .. this data tends to grow exponentially and I need an efficient way to execute queries .. database grows by ~3GB/day and I'm looking to keep data for at least 3 months .. therefore, I've connected my CloudSQL to BigQuery .. the following is an example of a query that I'm running on BigQuery but I'm skeptical .. not sure if the query is being executed in Postgres or BigQuery ..
SELECT * FROM EXTERNAL_QUERY("project.us-cloudsql-instance", "SELECT date_trunc('day', created_at) d, variable1, AVG(variable2) FROM my_table GROUP BY 1,2 ORDER BY d;");
seems like the query is being executed in postgreSQL though, not BigQuery .. is this true? if it is, is there a way for me to load data from postgresql to bigquery in realtime and execute queries directly in bigquery ?
Upvotes: 2
Views: 1202
Reputation: 1524
I think you are using federated queries
. These queries are intended to collect data from BigQuery and from a CloudSQLInstance:
BigQuery Cloud SQL federation enables BigQuery to query data residing in Cloud SQL in real-time, without copying or moving data. It supports both MySQL (2nd generation) and PostgreSQL instances in Cloud SQL.
The query is being executed in CloudSQL and this could lead into a lower performance than if you run in BigQuery.
EXTERNAL_QUERY executes the query in Cloud SQL and returns results as a temporary table. The result would be a
BigQuery
table.
Now, the current ways to load data into BigQuery are from: GCS, other Google Ad Manager and Google Ads, a readtable data source, By inserting individual records using streaming inserts, DML statements and BigQuery I/O transform in a Dataflow pipeline.
This solution is well worth to take a look which is pretty similar to what you need:
The MySQL to GCS operator executes a SELECT query against a MySQL table. The SELECT pulls all data greater than (or equal to) the last high watermark. The high watermark is either the primary key of the table (if the table is append-only), or a modification timestamp column (if the table receives updates). Again, the SELECT statement also goes back a bit in time (or rows) to catch potentially dropped rows from the last query (due to the issues mentioned above).
With Airflow they manage to keep BigQuery synchronized to their MySQL database every 15 minutes.
Upvotes: 1
Reputation: 5503
Although technically, it is possible to rewrite the query as
SELECT date_trunc('day', created_at) d, variable1, AVG(variable2)
FROM EXTERNAL_QUERY("project.us-cloudsql-instance",
"SELECT created_at, variable1, variable2 FROM my_table")
GROUP BY 1,2 ORDER BY d;
It is not recommended though. Better do aggregation and filtering on CloudSQL as much as possible to reduce the amount of data that has to be transfered from CloudSQL to BigQuery.
Upvotes: 1