Airflow - GoogleCloudStorageToBigQueryOperator does not render templated source_objects

Question

The documentation states that the source_objects argument takes templated values. However when I try the following:

gcs_to_bq_op = GoogleCloudStorageToBigQueryOperator(
    task_id=name,
    bucket='gdbm-public',
    source_objects=['entity/{{ ds_nodash }}.0.{}.json'.format(filename)],
    destination_project_dataset_table='dbm_public_entity.{}'.format(name),
    schema_fields=schema,
    source_format='NEWLINE_DELIMITED_JSON',
    create_disposition='CREATE_IF_NEEDED',
    write_disposition='WRITE_TRUNCATE',
    max_bad_records=0,
    allow_jagged_rows=True,
    google_cloud_storage_conn_id='my_gcp_conn',
    bigquery_conn_id='my_gcp_conn',
    delegate_to=SERVICE_ACCOUNT,
    dag=dag
    )

I receive the error message: Exception: BigQuery job failed. Final error was: {u'reason': u'notFound', u'message': u'Not found: URI gs://gdbm-public/entity/{ ds_nodash }.0.GeoLocation.json'}.

I found an example where the {{ ds_nodash }} variable is used in the same way. So I'm not sure why this doesn't work for me.

Dustin Ingram · Accepted Answer

The issue is that calling .format on the string is causing one set of the double braces to be removed:

>>> 'entity/{{ ds_nodash }}.0.{}.json'.format(filename)
'entity/{ ds_nodash }.0.foobar.json'

You need to escape the braces that you want to be in the string by doubling them:

>>> 'entity/{{{{ ds_nodash }}}}.0.{}.json'.format(filename)
'entity/{{ ds_nodash }}.0.foobar.json'

Airflow - GoogleCloudStorageToBigQueryOperator does not render templated source_objects

Answers (2)

Related Questions