Reputation: 590
Is there a way to directly use data on Google BigQuery as train/test data of ML algorithms supported by ML Engine?
What I saw in documents is to use data stored on Google Cloud Storage.
Upvotes: 0
Views: 319
Reputation: 355
I think Torry provided a good answer. To add to that, please take a look into the Baby weight Example. Thus, beside the setup.py file, you should modify the model.py as described in here.
To make long story short:
train_query, eval_query = create_queries()
train_df = query_to_dataframe(train_query)
eval_df = query_to_dataframe(eval_query)
train_x = input_fn(train_df)
Upvotes: 1
Reputation: 375
Yes, this is possible. Please refer to this blog post for a detailed guide. The GitHub repo is here but here are the key takeaways.
setup.py
from setuptools import setup
setup(name='trainer',
version='1.0',
description='Showing how to use private key',
url='http://github.com/GoogleCloudPlatform/training-data-analyst',
author='Google',
author_email='[email protected]',
license='Apache2',
packages=['trainer'],
package_data={'': ['privatekey.json']},
install_requires=[
'pandas-gbq==0.4.1',
'urllib3',
'google-cloud-bigquery'
],
zip_safe=False)
pkg_query.py
def query_to_dataframe(query):
import pandas as pd
import pkgutil
privatekey = pkgutil.get_data('trainer', 'privatekey.json')
print(privatekey[:200])
return pd.read_gbq(query,
project_id='cloud-training-demos',
dialect='standard',
private_key=privatekey)
query = """
SELECT
year,
COUNT(1) as num_babies
FROM
publicdata.samples.natality
WHERE
year > 2000
GROUP BY
year
"""
df = query_to_dataframe(query)
print(df.head())
Upvotes: 2