Reputation: 115
Is there a way to run real time analytics with BigQuery? I used the CSV upload option that kicks off a job and loads the data in an offline mode which can be analyzed once the load completes. But in the announcement regarding BigQuery there are mentions of using BigQuery for realtime analytics. How can this be achieved? Can we append (no updates) data from Google Cloud database to BigQuery in a trickle mode for realtime analytics?
As a sidenote, I noticed that BigQuery CSV data loads are about an order of magnitude slower than LucidDB and InfiniDB running on my local PC using a 10GB data file. It took 34 minutes for the BigQuery job to complete vs. 5 minutes on InfiniDB and LucidDB. The query execution times (for simple aggregates) are twice as slower on BigQuery compared to InfiniDB (6 seconds vs. 3 seconds against the 10GB file loaded with approx. 30+ Million records) but better than LucidDB.
Upvotes: 3
Views: 6858
Reputation: 94
If you're looking for a SQL-based realtime analytics layer to run in parallel with a relational data warehouse, we recently released a realtime analytics API product called Stride that uses continuous SQL queries on streaming data to power the type of realtime analytics layer that you're talking about. Stride is based on our open-source streaming-SQL database, PipelineDB, which is a fork of PostgreSQL, and which will actually become a standard PostgreSQL extension by the end of this year.
The nice thing about continuous SQL queries on streams for your realtime analytics layer is that if you have a realtime need, then by definition you already know the queries you want to run, so continuous queries both speed up and drastically simplify your realtime data architecture, while reducing costs otherwise incurred from storing extraneous granular data.
Upvotes: 0
Reputation: 63
It is better to keep real time analytics and your data warehousing separate. One can be optimized for data collection speed to provide alerts and triggers. The latter for massively parallel searches and aggregation.
There is not really a solution for both as they are mutually exclusive. To be fast at large data aggregation, you have to perform lots of work in indexing and storing data - the very action you want to minimize in order to gain access to data quickly.
Upvotes: 2
Reputation: 490
The best way to import large amount of data in BiqQuery is by using the python tools provided by Google. It is the most effcient way to upload data. I am using these tools
Upvotes: 0
Reputation: 59175
2013 update: Now you can stream data into BigQuery, and query it in realtime.
(data you stream in is available instants later for querying)
https://developers.google.com/bigquery/streaming-data-into-bigquery
Upvotes: 8
Reputation: 7877
There is currently no way to directly stream data one record at a time with BigQuery. The only supported ingestion methods are to import csv files from either Google Cloud storage, or directly via multi-part mime POST request. Quota for ingestion are 2 imports/min, 1,000 import requests per day, 100GB total data per import request (see: https://developers.google.com/bigquery/docs/quota-policy ).
There's a distinction to be made between "real time analytics" (ability to run very quick ad-hoc queries over massive data) and the constant collection of "real time data" (which key-value datastores are better suited to support). BigQuery's current ingestion mechanism does not support constant single record updates, but it does allows developers to run queries very quickly over massive datasets.
While Stack Overflow is not the most appropriate place for benchmark discussion, I would add that it's difficult to benchmark ingestion and querying speeds without understanding the schema of the data you used. While 10Gb is large amount of data, it would be interesting to see a comparison of ingestion and query speed over much much larger datasets - say, 10 terabytes or more.
Upvotes: 3