Hajar Homayouni
Hajar Homayouni

Reputation: 590

Use ML Engine dirctly with the data stored on Google BigQuery tables

Is there a way to directly use data on Google BigQuery as train/test data of ML algorithms supported by ML Engine?

What I saw in documents is to use data stored on Google Cloud Storage.

Upvotes: 0

Views: 319

Answers (2)

Shahin Vakilinia
Shahin Vakilinia

Reputation: 355

I think Torry provided a good answer. To add to that, please take a look into the Baby weight Example. Thus, beside the setup.py file, you should modify the model.py as described in here.

To make long story short:

train_query, eval_query = create_queries()
train_df = query_to_dataframe(train_query)
eval_df = query_to_dataframe(eval_query)
train_x = input_fn(train_df)

Upvotes: 1

Torry Yang
Torry Yang

Reputation: 375

Yes, this is possible. Please refer to this blog post for a detailed guide. The GitHub repo is here but here are the key takeaways.

setup.py

from setuptools import setup

setup(name='trainer',
  version='1.0',
  description='Showing how to use private key',
  url='http://github.com/GoogleCloudPlatform/training-data-analyst',
  author='Google',
  author_email='[email protected]',
  license='Apache2',
  packages=['trainer'],
  package_data={'': ['privatekey.json']},
  install_requires=[
      'pandas-gbq==0.4.1',
      'urllib3',
      'google-cloud-bigquery'
  ],
  zip_safe=False)

pkg_query.py

def query_to_dataframe(query):
  import pandas as pd
  import pkgutil
  privatekey = pkgutil.get_data('trainer', 'privatekey.json')
  print(privatekey[:200])
  return pd.read_gbq(query,
                 project_id='cloud-training-demos',
                 dialect='standard',
                 private_key=privatekey)

query = """
SELECT
  year,
  COUNT(1) as num_babies
FROM
  publicdata.samples.natality
WHERE
  year > 2000
GROUP BY
  year
"""

df = query_to_dataframe(query)
print(df.head())

Upvotes: 2

Related Questions