Reputation: 525
I'm using Datalab on Google Cloud Platform and was trying to create a BigQuery dataset with google.datalab.bigquery when I found I needed the .Client
method that was only in google.cloud.bigquery library.
What's the difference between the datalab and cloud versions of the bigquery library?
Is the datalab one a slimmed down version of the cloud library, or do they have different intended uses?
Upvotes: 3
Views: 754
Reputation: 4166
google.cloud.bigquery
is the Python client library for BigQuery. It provides access to all the functionality of the BigQuery rest API and is similar to the client library for Java, Go, C++ and other languages. It is essentially the idiomatic Python wrapper for things you can do with the bq service.
google.datalab.bigquery
is a Python library that is meant for use within notebooks by data scientists. For example, it has a method to take a BigQuery result set and convert it into a pandas data frame. Also, mltoolbox to simplify training and evaluation of machine learning models. There is no Java or Go equivalent. It uses the client library to actually talk to BigQuery.
Update (July 2019): google.cloud.bigquery has now been updated to include many of the nice things the datalab package used to provide, including Pandas interoperability. At this point, google.cloud.bigquery should be considered the preferred way to do things, even in notebooks. For example, the %%bigquery
magic comes as part of google.cloud.bigquery. Instead of using mltoolbox in Datalab, use BigQuery ML to train ML models directly in BigQuery.
Upvotes: 3
Reputation: 1696
Disclaimer: This is not an overview of intended uses nor deep differences, but an overview of superficial differences between these packages.
One (not satisfying answer) could be to analyze the usage, inferring from installations.
Row project num_downloads
1 google-cloud-bigquery 619666
2 datalab 5313
I inferred this using bigquery query (like described here):
#standardSQL
SELECT
file.project,
COUNT(*) AS num_downloads
FROM
`the-psf.pypi.downloads*`
WHERE
file.project IN ('google-cloud-bigquery','datalab')
-- Only query the last 60 days of history
AND _TABLE_SUFFIX BETWEEN FORMAT_DATE(
'%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 60 DAY))
AND FORMAT_DATE('%Y%m%d', CURRENT_DATE())
GROUP BY
file.project
ORDER BY
num_downloads DESC
Therefore you can see, that cloud bigquery is "more popular" (due to the fact it gets installed automatically with google-cloud-python?).
If you want to go more into details, have a look into the code (github: google.cloud.bigquery vs github: google.datalab.bigquery), you will see there are a lot of differences in the code of the packages.
Further investigation of the Insights page on github (cloud vs pydatalab) shows us more differences:
cloud.bigquery exists longer (since January 2014 compared to May 2016, assuming it exists since the repo exists). Pydatalab get's developed by other contributors than the cloud.bigquery package. And, last, the cloud.bigquery has some more activity (maybe related to other packages also included).
So, even if this is maybe not what you wanted or expected as an answer, I can say from a first look on the code and the documentation (compare cloud vs pydatalab) that it seems that pydatalab is sligthly more comfortable even if it (seems that it) is not that much developed. So the answer is YES, they seem to be for different purposes.
Upvotes: 2