Georges Lorré
Georges Lorré

Reputation: 443

Write to a datepartitioned Bigquery table using the beam.io.gcp.bigquery.WriteToBigQuery module in apache beam

I'm trying to write a dataflow job that needs to process logs located on storage and write them in different BigQuery tables. Which output tables are going to be used depends on the records in the logs. So I do some processing on the logs and yield them with a key based on a value in the log. After which I group the logs on the keys. I need to write all the logs grouped on the same key to a table.

I'm trying to use the beam.io.gcp.bigquery.WriteToBigQuery module with a callable as the table argument as described in the documentation here

I would like to use a date-partitioned table as this will easily allow me to write_truncate on the different partitions.

Now I encounter 2 main problems:

The destination table's partition table_name_x$20190322 is outside the allowed bounds. You can only stream to partitions within 31 days in the past and 16 days in the future relative to the current date."

This seems like a limitation of streaming inserts, any way to do batch inserts ?

Maybe I'm approaching this wrong, and should use another method. Any guidance as how to tackle these issues are appreciated.

Im using python 3.5 and apache-beam=2.13.0

Upvotes: 0

Views: 1155

Answers (1)

Cubez
Cubez

Reputation: 918

That error message can be logged when one mixes the use of an ingestion-time partitioned table a column-partitioned table (see this similar issue). Summarizing from the link, it is not possible to use column-based partitioning (not ingestion-time partitioning) and write to tables with partition suffixes.

In your case, since you want to write to different tables based on a value in the log and have partitions within each table, forgo the use of the partition decorator when selecting which table (use "[prefix]_YYYYMMDD") and then have each individual table be column-based partitioned.

Upvotes: 3

Related Questions