Sam Tsai
Sam Tsai

Reputation: 373

Google Dataflow Template size cap at 10Mb

I set up a template on Google dataflow and it ran fine. After some modification of adding parallel processing on partition, the template size grew much bigger. I tried to run it and it failed. I got some error like the following

Template file 'gs://my-bucket/templates/my-template-name' was too large. Max size is 10485760 bytes.

Looked like gcp has a cap on the template size around 10 MB. Is there any way to increase the limit or compress the generated template? The update I did is pretty much create partition from a pCollection. Then each pCollection in this pCollectionList starts same structure of the transform and file write. Without partition, the size is 1.5 mb. partition to 4 paritions, it grew to 6 mb. When going for 8 partitions, it grew to 12 mb. Isn't this limited the complexity of the pipeline can be?

Here is some description on the partition. Origin process is like this String option -> pCollection as input files -> TextIO -> sort -> write

After the partition is like

String option -> pColletion as input files -> parition -> each partition does TextIO -> sort -> write

The partition in the middle is the only major change. Why would this make the size of template grew few times bigger?

Upvotes: 2

Views: 1424

Answers (1)

Yueyang Qiu
Yueyang Qiu

Reputation: 159

This is a known issue for Dataflow. If you are using Beam SDK >= 2.9, you can add --experiments=upload_graph to the command you use to generate the template. It should help you generate a smaller template. However, I am not sure if this feature is fully available to all Dataflow users now, since it has just been implemented. If not, it may take a few weeks for it to be fully available.

Upvotes: 2

Related Questions