jfk83
jfk83

Reputation: 810

How can I export data from Bigquery to an external server in a CSV?

I need to automate a process to extract data from Google Big Query and exported to an external CSV in a external server outside of the GCP.

I just researching how to to that I found some commands to run from my External Server. But I prefer to do everything in GCP to avoid possible problems.

To run the query to CSV in Google storage

bq --location=US extract --compression GZIP 'dataset.table' gs://example-bucket/myfile.csv

To Download the csv from Google Storage

gsutil cp gs://[BUCKET_NAME]/[OBJECT_NAME] [OBJECT_DESTINATION]

But I would like to hear your suggestions

Upvotes: 6

Views: 5073

Answers (1)

Joan Grau Noël
Joan Grau Noël

Reputation: 3192

If you want to fully automatize this process, I would do the following:

  1. Create a Cloud Function to handle the export:

This is the more lightweight solution, as Cloud Functions are serverless, and provide flexibility to implement code with the Client Libraries. See the quickstart, I recommend you to use the console to create the functions to start with.

In this example I recommend you to trigger the Cloud Function from an HTTP request, i.e. when the function URL is called, it will run the code inside of it.

An example Cloud Function code in Python, that creates the export when a HTTP request is made:

main.py

from google.cloud import bigquery

def hello_world(request):
    project_name = "MY_PROJECT"
    bucket_name = "MY_BUCKET"
    dataset_name = "MY_DATASET"
    table_name = "MY_TABLE"
    destination_uri = "gs://{}/{}".format(bucket_name, "bq_export.csv.gz")

    bq_client = bigquery.Client(project=project_name)

    dataset = bq_client.dataset(dataset_name, project=project_name)
    table_to_export = dataset.table(table_name)

    job_config = bigquery.job.ExtractJobConfig()
    job_config.compression = bigquery.Compression.GZIP

    extract_job = bq_client.extract_table(
        table_to_export,
        destination_uri,
        # Location must match that of the source table.
        location="US",
        job_config=job_config,
    )  
    return "Job with ID {} started exporting data from {}.{} to {}".format(extract_job.job_id, dataset_name, table_name, destination_uri)

requirements.txt

google-cloud-bigquery

Note that the job will run asynchronously in the background, you will receive a return response with the job ID, which you can use to check the state of the export job in the Cloud Shell, by running:

bq show -j <job_id>
  1. Create a Cloud Scheduler scheduled job:

Follow this documentation to get started. You can set the Frequency with the standard cron format, for example 0 0 * * * will run the job every day at midnight.

As a target, choose HTTP, in the URL put the Cloud Function HTTP URL (you can find it in the console, inside the Cloud Function details, under the Trigger tab), and as HTTP method choose GET.

Create it, and you can test it in the Cloud Scheduler by pressing the Run now button in the Console.

  1. Synchronize your external server and the bucket:

Up until now you only have scheduled exports to run every 24 hours, now to synchronize the bucket contents with your local computer, you can use the gsutil rsync command. If you want to save the imports, lets say to the my_exports folder, you can run, in your external server:

gsutil rsync gs://BUCKET_WITH_EXPORTS /local-path-to/my_exports

To periodically run this command in your server, you could create a standard cron job in your crontab inside your external server, to run each day as well, just at a few hours later than the bigquery export, to ensure that the export has been made.

Extra:

I have hard-coded most of the variables in the Cloud Function to be always the same. However, you can send parameters to the function, if you do a POST request instead of a GET request, and send the parameters as data in the body.

You will have to change the Cloud Scheduler job to send a POST request to the Cloud Function HTTP URL, and in the same place you can set the body to send the parameters regarding the table, dataset and bucket, for example. This will allow you to run exports from different tables at different hours, and to different buckets.

Upvotes: 6

Related Questions