Reputation: 3772
How can I create a new BigQuery dataset in Dataflow to save my data in?
I would like the dataset name to be versioned with the version tag from the dataflow program.
I am using the python API and tried to use the BigQuery client to do this outside of the beam.io.BigQuerySink
but then I get the following error when running the flow on gcp: ImportError: No module named cloud
which refers to the bigquery import from google.cloud import bigquery
.
What would be the best way to do this?
Upvotes: 0
Views: 511
Reputation: 14781
You are on the right track with using the BigQuery client outside your sink. It should look something like this:
[..]
from google.cloud import bigquery
client = bigquery.Client(project='PROJECT_ID')
dataset = client.dataset(DATASET_NAME)
dataset.create()
[..]
You have to remember, that although this may work when you run your pipeline locally, the VMs that are spun up in the worker pool when you run it remotely on GCP will not have the same dependancies as your local machine.
So, you need to install the dependancies remotely by following the steps outlined here:
pip freeze > requirements.txt
. This will create a requirements.txt file that lists all packages that have been installed on your machine, regardless of where they came from (i.e. were installed from).--requirements_file requirements.txt
. This will stage the requirements.txt file to the staging location you defined.Upvotes: 2