Reputation: 338
What I want to do is read an existing table and generate a new table which has the same schema as the original table plus a few extra column (computed from some columns of the original table). The original table schema can be increased without notice to me (the fields I am using in my dataflow job won't change), so I would like to always read the schema instead of defining some custom class which contains the schema.
In Dataflow SDK 1.x, I can get the TableSchema via
final DataflowPipelineOptions options = ...
final String projectId = ...
final String dataset = ...
final String table = ...
final TableSchema schema = new BigQueryServicesImpl()
.getDatasetService(options)
.getTable(projectId, dataset, table)
.getSchema();
For Dataflow SDK 2.x, BigQueryServicesImpl has become a package-private class.
I read the responses in Get TableSchema from BigQuery result PCollection<TableRow> but I'd prefer not to make a separate query to BigQuery. As that response is now almost 2 years old, are there other thoughts or ideas from the SO community?
Upvotes: 3
Views: 768
Reputation: 1591
Due to how BigQueryI/O is setup now. It needs to query the table schema before the pipleine begins to run. This is a good feature idea, but its not feasible in a single pipeline. In the example you linked the table schema is queries before running the pipeline.
If new columns are added, then unfortunately a new pipeline must be relaunched.
Upvotes: 1