Richard
Richard

Reputation: 65530

Tools for running analysis on data held in BigQuery?

I have about 100GB data in BigQuery, and I'm fairly new to using data analysis tools. I want to grab about 3000 extracts for different queries, using a programmatic series of SQL queries, and then run some statistical analysis to compare kurtosis across those extracts.

Right now my workflow is as follows:

The second one of these works fine, but it's pretty slow and painful to save all 3000 data extracts locally (network timeouts, etc).

Is there a better way of doing this? Basically I'm wondering if there's some kind of cloud tool where I could quickly run the calls to get the 3000 extracts, then run the Python to do the kurtosis analysis.

I had a look at https://cloud.google.com/bigquery/third-party-tools but I'm not sure if any of those do what I need.

Upvotes: 0

Views: 263

Answers (3)

myitn
myitn

Reputation: 11

You can check out Cooladata

It allows you to query BQ tables as external data sources. What you can do is either schedule your queries and export the results to Google storage, where you can pick up from there, or use the built in powerful reporting tool to answer your 3000 queries. It will also provide you all the BI tools you will need for your business.

Upvotes: 0

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172993

So far Cloud Datalab is your best option
https://cloud.google.com/datalab/
It is in beta so some surprises are possible
Datalab is built on top of below (Jupyter/IPython) option and totally in cloud

Another option is Jupyter/IPython Notebook
http://jupyter-notebook-beginner-guide.readthedocs.org/en/latest/

Our data sience team started with second option long ago with great success and now are moving toward Datalab

For the rest of the business (prod, bi, ops, sales, marketing, etc.), though, we had to build our own workflow/orchestration tool as nothing around was found good or relevant enough.

Upvotes: 2

Zig Mandel
Zig Mandel

Reputation: 19835

two easy ways:

1: if your issue is network like you say, use a google compute engine machine to do the analisis, in the same zone as your bigquery tables (us, eu etc). it will not have network issues getting data from bigquery and will be super-fast. the machine will only cost you for the minutes you use it. save a snapshot of your machine to reuse the machine setup anytime (snapshot also has monthly cost but much lower than having the machine up.)

2: use Google cloud Datalab (beta as of dec. 2015) which supports bigquery sources and gives you all the tools you need to do the analysis and later share it with others: https://cloud.google.com/datalab/

from their docs: "Cloud Datalab is built on Jupyter (formerly IPython), which boasts a thriving ecosystem of modules and a robust knowledge base. Cloud Datalab enables analysis of your data on Google BigQuery, Google Compute Engine, and Google Cloud Storage using Python, SQL, and JavaScript (for BigQuery user-defined functions)."

Upvotes: 1

Related Questions