Creating new BigQuery datasets in Dataflow

Question

How can I create a new BigQuery dataset in Dataflow to save my data in?

I would like the dataset name to be versioned with the version tag from the dataflow program.

I am using the python API and tried to use the BigQuery client to do this outside of the beam.io.BigQuerySink but then I get the following error when running the flow on gcp: ImportError: No module named cloud which refers to the bigquery import from google.cloud import bigquery.

What would be the best way to do this?

Graham Polley · Accepted Answer

You are on the right track with using the BigQuery client outside your sink. It should look something like this:

[..]
from google.cloud import bigquery
client = bigquery.Client(project='PROJECT_ID')
dataset = client.dataset(DATASET_NAME)
dataset.create()
[..]

You have to remember, that although this may work when you run your pipeline locally, the VMs that are spun up in the worker pool when you run it remotely on GCP will not have the same dependancies as your local machine.

So, you need to install the dependancies remotely by following the steps outlined here:

Find out which packages you have installed on your machine. Run the following command: pip freeze > requirements.txt. This will create a requirements.txt file that lists all packages that have been installed on your machine, regardless of where they came from (i.e. were installed from).
In the requirements.txt file, leave only the packages that were installed from PyPI and are used in the workflow source. Delete the rest of the packages that are irrelevant to your code.
Run your pipeline with the following command-line option: --requirements_file requirements.txt. This will stage the requirements.txt file to the staging location you defined.

Creating new BigQuery datasets in Dataflow

Answers (1)

Related Questions