Reputation: 2028
As part of our data engineering testing efforts we would like to check the schema of our views and tables against our schema files (JSON) before we run them so as to be able to capture breakage in typing during integration tests before running the pipeline.
Our current implementation checks for the typing after having created the table, which could result in issues if a table is created with the wrong schema:
sql = (
f"CREATE OR REPLACE VIEW `{project}`.{dataset_key}.{table_name} AS (\n" +
templating.render(path, render_context) + "\n" +
")"
)
BigQueryOperator(
task_id=f"view_{table_name}",
sql=sql,
use_legacy_sql=False,
# maximum_billing_tier=10,
).execute(context=context)
# asserting schema
if schema_path is not None:
bq_hook = BigQueryHook(location="northamerica-northeast1", use_legacy_sql=False)
with open(schema_path, 'r') as f:
schema = json.load(f)
fields = bq_hook.get_records(
f"""
SELECT
column_name AS name,
type,
CASE WHEN is_nullable = 'NO' THEN 'REQUIRED' ELSE 'NULLABLE' END AS mode,
FROM
`{project}.clinic_data_{clinic_id}`.INFORMATION_SCHEMA.COLUMNS
WHERE table_name = '{table_name}'
ORDER BY ordinal_position ASC
""")
LOG.info('fields: %s', fields)
LOG.info('schema: %s', schema)
LOG.info('fields == schema: %s', fields == schema)
assert fields == schema, "Expected created view to match schema"
Where the schema is a list of dictionaries in the format expected by BQ:
[
{
"name": "date_key",
"type": "DATE",
"mode": "REQUIRED"
},
{
"name": "year",
"type": "INT64",
"mode": "REQUIRED"
},
# ...
]
We would like to transition to an approach where we can run the table SELECT in dry_run mode and obtain the resulting schema of the SELECT query so as to reduce chances of breakage at runtime and reduce our development feedback loop. Hence the question: Is there a way to get the schema of a SELECT statement without running it (dry_run)?
We could not find a way to do this using the QueryJob parameters returned by a QueryJobConfig with dry_run=True, but we may have overlooked some things :)
Any answer would be much appreciated!
Upvotes: 2
Views: 195
Reputation: 172944
Is there a way to get the schema of a SELECT statement without running it
Try below approach
WHERE FALSE
to the very end of your SELECT statement and run itUpvotes: 3