Reputation: 2164
This question concerns what should be an appropriate architecture on Google Cloud Platform for my particular use case.
I have a bunch of .yaml
files that I would like to run SQL queries on using Google Cloud Platform's products. The combined size of these files would not be more than 30MB and each file would be on average about 50KB. New files would also not be added very frequently - about 2-3 times a year.
I was thinking I could design an architecture where all these files are saved on Cloud Storage, I run a Dataflow pipeline/Cloud Functions to convert these .yaml
files to .json
and then import them into BigQuery to run SQL queries.
What seems to be an appropriate approach? Using Dataflow or Cloud Functions for pre-processing or something else entirely?
I am also comfortable with Python so would be looking for a solution that incorporates that. For example Dataflow has a Python SDK.
Upvotes: 0
Views: 2758
Reputation: 14781
BigQuery probably isn't the right tool for this. Also, using a VM is a bit of work and will be expensive too. You'll also need to maintain that VM.
Here's an approach using Cloud Functions. I'm going to assume you don't have to use SQL and can simply load the file(s) contents into memory and simply do basic string searching. The code is a little crude and cobbled together from other answers on SO, but it should be enough to get you going.
Here's the code:
index.js
:
const storage = require('@google-cloud/storage')();
exports.searchYAML = function searchYAML(event) {
return new Promise(function(resolve, reject) {
const file = event.data;
(storage
.bucket(file.bucket)
.file(file.name)
.download()
.then(function(data) {
if (data)
return data.toString('utf-8');
})
.then(function(data) {
if (data) {
console.log("New file " + file.name);
console.log(data);
//Do some searching/logic with the file contents here
resolve(data);
}
})
.catch(function(e) {
reject(e);
})
);
});
};
package.js
:
{
"main": "index.js",
"dependencies": {
"@google-cloud/storage": "^1.2.1"
}
}
Upvotes: 2
Reputation: 81346
None of your proposed ideas are a good fit.
It will take longer to launch Cloud Dataflow than the actual processing time (10 minutes to launch, 1 second to process). You are trying to use a Mac truck to deliver a toothpick.
30 MB of YAML files is tiny. By the time you wrote a Dataflow python script, you will have already converted your YAML files into Json.
YAML converted to Json is not a good use of BigQuery. BigQuery is column based for structured data. Converting and flattening Json can be problematic. This is a task for a simple in-memory NoSQL query engine.
This is a very small task that will easily fit on the smallest Compute Engine VM instance running a Python script. App Engine would be another good choice.
Upvotes: 1