Pub/Sub to Datastore Batch Dataflow Job in Apache Beam Python SDK Possible?

Question

I have a Pub/Sub topic which will periodically (usually once every few days or weeks, but sometimes more often) receive batches of messages. I'd like to kick off a batch Dataflow job to read these messages, perform some transformations, write the result to Datastore, then stop running. When a new batch of messages goes out, I want to kick off a new job. I've read over the Apache Beam Python SDK docs and many SO questions, but am still uncertain about a few things.

Can Pub/Sub IO read as part of a non-streaming job? Can the same job then write using Datastore IO (which doesn't currently support streaming)? Can I assume the default global window and trigger will correctly tell the job when to stop reading from Pub/Sub (when the batch of messages is no longer being written)? Or do I need to add some sort of trigger/windowing scheme like max time or max number of elements? Would that trigger, when triggered, tell the global window to close and therefore end the job?

Pub/Sub to Datastore Batch Dataflow Job in Apache Beam Python SDK Possible?

Answers (1)

Related Questions