YAML files to Google BigQuery. What should my architecture on Google Cloud Platform be?

This question concerns what should be an appropriate architecture on Google Cloud Platform for my particular use case.

I have a bunch of .yaml files that I would like to run SQL queries on using Google Cloud Platform's products. The combined size of these files would not be more than 30MB and each file would be on average about 50KB. New files would also not be added very frequently - about 2-3 times a year.

I was thinking I could design an architecture where all these files are saved on Cloud Storage, I run a Dataflow pipeline/Cloud Functions to convert these .yaml files to .json and then import them into BigQuery to run SQL queries.

What seems to be an appropriate approach? Using Dataflow or Cloud Functions for pre-processing or something else entirely?

I am also comfortable with Python so would be looking for a solution that incorporates that. For example Dataflow has a Python SDK.

Upvotes: 0

Answers (2)

Graham Polley

Reputation: 14781

BigQuery probably isn't the right tool for this. Also, using a VM is a bit of work and will be expensive too. You'll also need to maintain that VM.

Here's an approach using Cloud Functions. I'm going to assume you don't have to use SQL and can simply load the file(s) contents into memory and simply do basic string searching. The code is a little crude and cobbled together from other answers on SO, but it should be enough to get you going.

Create a Cloud Function (example uses Node, but could be Python too) to listen to your bucket and set the memory to an appropriate size e.g 256MB. That's because we'll be reading the contents into memory. Also, bump up the max execution time from the default of 60s.

Deploy it.

Upload a YAML file to the GCS bucket and it will trigger the Cloud Function to execute and then read the contents of the file into memory.

Examine the logs of the function that executed and you can see it read the contents of the YAML file into memory.

Here's the code:

index.js:

const storage = require('@google-cloud/storage')();

exports.searchYAML = function searchYAML(event) {
    return new Promise(function(resolve, reject) {
        const file = event.data;
        (storage
            .bucket(file.bucket)
            .file(file.name)
            .download()
            .then(function(data) {
                if (data)
                    return data.toString('utf-8');
            })
            .then(function(data) {
                if (data) {
                    console.log("New file " + file.name);
                    console.log(data);
                    //Do some searching/logic with the file contents here
                    resolve(data);
                }
            })
            .catch(function(e) {
                reject(e);
            })
        );
    });
};

package.js:

{
  "main": "index.js",
  "dependencies": {
    "@google-cloud/storage": "^1.2.1"
  }
}

Upvotes: 2

John Hanley

Reputation: 81346

None of your proposed ideas are a good fit.

It will take longer to launch Cloud Dataflow than the actual processing time (10 minutes to launch, 1 second to process). You are trying to use a Mac truck to deliver a toothpick.

30 MB of YAML files is tiny. By the time you wrote a Dataflow python script, you will have already converted your YAML files into Json.

YAML converted to Json is not a good use of BigQuery. BigQuery is column based for structured data. Converting and flattening Json can be problematic. This is a task for a simple in-memory NoSQL query engine.

This is a very small task that will easily fit on the smallest Compute Engine VM instance running a Python script. App Engine would be another good choice.

Upvotes: 1

YAML files to Google BigQuery. What should my architecture on Google Cloud Platform be?

Answers (2)

Related Questions