Reputation: 1558
I'm wondering what some of the best practices/tools are people have found for building and managing ETL jobs on bigquery.
At the moment I have lots of sql 'templates' (horribly parameterized by lob, date etc using sed type string replacements into a tmp.sql file and then running that) and I use the command line tool to run sequences of them and send output to tables. It works fine but is getting a bit unwieldy. I still don't get why I can't run stored procedure type parameterized scripts on bigquery. Or even some sort of gui to build and manage pipelines.
I love bigquery but really feel like I'm either missing something very obvious here or its a real gap in the product (e.g. Pretty sure Apache Drill more built out in this regard).
So just wondering if anyone can share any best practice etl tips or approaches you use yourself.
I do also use xplenty for some jobs which is good but it's also a bit messy in that I can't just write sql in it so can be painful to build and debug complicated pipelines.
Was thinking about looking into Talend also, but really parameterized stored procedures, macros and SQL is all i'd ideally need.
Sorry if this is more of a discussion question then specific code. Happy to move it to reddit or something if more suited there.
Upvotes: 2
Views: 1490
Reputation: 1167
Nowadays a dbt project is the answer:
Upvotes: 0
Reputation: 432
Google Cloud Dataflow is closer to your needs than BigQuery in my opinion. We use it for real-time streaming ETL with automatic scaling. Works great, though you will need to code Java.
Upvotes: 1