Reputation: 33
we currently have a DBT instance that sits over our Google BigQuery data warehouse. Now we've recently been asked to incorporate some data from Google Sheets into our modelling.
With that, is it possible for DBT to connect directly with Google Sheets? i.e. configure Google Sheets as a direct external datasource in the .yml file, or have DBT possibly run some sort of BigQuery federated SQL statement?
There's a DBT package called dbt-external-tables (https://hub.getdbt.com/dbt-labs/dbt_external_tables/latest/), but that only seems to work with BigQuery + files in Google Cloud Storage buckets.
But the common and most straightforward option I'm seeing in forums and documentation is to create an external table on BigQuery on top of the Google Sheet. And then have DBT connect to the external BigQuery table.
Just wanted to check if the above common option for integrating DBT x Google Sheets x BigQuery is in fact the only option, or if there's actually a way to have DBT connect directly to Google Sheets before hitting BigQuery?
Thanks
Upvotes: 3
Views: 5084
Reputation: 41
You can set up a Google Sheet as a source with dbt_external_tables
if you use BigQuery (edit 2023-08-10, based on feedback from a Databricks user). Just found this out last week. Here's an example with schema specified. If you leave out the columns it's supposed to be able to autodetect, but it didn't work for me on the sheet I did.
2023-04-21 updating with better config example and commands.
Here is a config showing both a configured schema and autodetect, both of which worked for me.
version: 2
sources:
- name: google_sheets
description: >
This source contains data stored in Google Sheets that are manually maintained.
project: my-project
dataset: stage_google_sheet
loader: external_table
tags:
- source
- google_sheet
tables:
- name: marketing_taxonomy
description: >
Marketing source channels, categories, etc.
external:
options:
format: GOOGLE_SHEETS
uris: ['https://docs.google.com/spreadsheets/d/[SHEET-ID]']
sheet_range: data
skip_leading_rows: 1
columns:
- name: source_raw
data_type: STRING
- name: source
data_type: STRING
- name: channel
data_type: STRING
- name: source_type
data_type: STRING
- name: source_category
data_type: STRING
- name: new_channel
data_type: STRING
- name: kpi_metrics_targets
description: >
KPI's maintained by FP&A
external:
options:
format: GOOGLE_SHEETS
uris: ['https://docs.google.com/spreadsheets/d/[SHEET-ID]']
sheet_range: 'Monthly Targets'
skip_leading_rows: 1
For people new to dbt_external_tables
, you have to run the following command before you can reference the source:
dbt run-operation stage_external_sources --args "select: google_sheets"
Update with your source name. In my case I had to add the --vars "ext_full_refresh: true"
option as well. That may not be necessary in all cases.
Edit: this works in BigQuery because BigQuery supports Google Sheets for its create external table
statement. See: https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#create_external_table_statement
Upvotes: 3
Reputation: 2763
From what I see over on the dbt-external-tables
side, the bigquery adapter folds to a DDL statement for the create_external_table macro.
Unfortunately, I just don't see a similar DDL statement available for the Google Sheets "external" definition. It looks like the UI probably executes something through the bq
cli client to create through from the web portal, if I had to guess.
If a section is ever added to this guide which includes a DDL definition for Google Drive based external sources, this would probably become a relatively easy build into the previously mentioned dbt macro for external tables. Until then, you will have to define this through the UI, the bq
client yourself, or the REST api.
Upvotes: 4