Write to a datepartitioned Bigquery table using the beam.io.gcp.bigquery.WriteToBigQuery module in apache beam

Question

I'm trying to write a dataflow job that needs to process logs located on storage and write them in different BigQuery tables. Which output tables are going to be used depends on the records in the logs. So I do some processing on the logs and yield them with a key based on a value in the log. After which I group the logs on the keys. I need to write all the logs grouped on the same key to a table.

I'm trying to use the beam.io.gcp.bigquery.WriteToBigQuery module with a callable as the table argument as described in the documentation here

I would like to use a date-partitioned table as this will easily allow me to write_truncate on the different partitions.

Now I encounter 2 main problems:

The CREATE_IF_NEEDED gives an error because it has to create a partitioned table. I can circumvent this by making sure the tables exist in a previous step and if not create them.
If i load older data I get the following error:

The destination table's partition table_name_x$20190322 is outside the allowed bounds. You can only stream to partitions within 31 days in the past and 16 days in the future relative to the current date."

This seems like a limitation of streaming inserts, any way to do batch inserts ?

Maybe I'm approaching this wrong, and should use another method. Any guidance as how to tackle these issues are appreciated.

Im using python 3.5 and apache-beam=2.13.0

Write to a datepartitioned Bigquery table using the beam.io.gcp.bigquery.WriteToBigQuery module in apache beam

Answers (1)

Related Questions