How often does PubsubIO.readStrings pull from subscription

Question

I'm trying to understand Beam/Dataflow concepts better, so pretend I have the following streaming pipeline:

pipeline
    .apply(PubsubIO.readStrings().fromSubscription("some-subscription"))
    .apply(ParDo.of(new DoFn() {
      @ProcessElement
      public void processElement(ProcessContext c) {
        String message = c.element();

        LOGGER.debug("Got message: {}", message);

        c.output(message);
      }
}));

How often will the unbounded source pull messages from the subscription? Is this configurable at all (potentially based on windows/triggers)?
Since no custom windowing/triggers have been defined, and there are no sinks (just a ParDo that logs + re-outputs the message), will my ParDo still be executed immediately as messages are received, and is that setup problematic in any way (not having any windows/triggers/sinks defined)?

jkff · Accepted Answer

It will pull messages from the subscription continuously - as soon as a message arrives, it will be processed immediately (modulo network and RPC latency).

Windowing and triggers do not affect this at all - they only affect how the data gets grouped at grouping operations (GroupByKey and Combine). If your pipeline doesn't have grouping operations, windowing and triggers are basically a no-op.

The Beam model does not have the concept of a sink - writing to various storage systems (e.g. writing files, writing to BigQuery etc) is implemented as regular Beam composite transforms, made of ParDo and GroupByKey like anything else. E.g. writing each element to its own file could be implemented by a ParDo whose @ProcessElement opens the file, writes the element to it and closes the file.

How often does PubsubIO.readStrings pull from subscription

Answers (1)

Related Questions