Significance of creating single pipeline having multiple input sources over multiple pipelines each having separate input sources defined?

Question

I am working on a project which receives requests from multiple clients through pubsub which dataflow pipelines will process in streaming mode to give out the responses. Each flow has some logic in common and also has read/writes from/to BigTable/BigQuery.

What are the pros and cons ( both development and maintenance side ) of using one single pipeline which receives input from different clients over separate pipeline for each input ?

jkff · Accepted Answer

In terms of development, these have about the same amount of complexity: you probably still have the common code written in one place, or perhaps even the entire pipeline code is identical but you're launching it with different parameters for different clients.

Maintenance-wise, there are pros and cons to both approaches.

One pipeline is likely to be cheaper. E.g. if traffic is overall very low and processing all the clients could fit on 1 machine, then it will actually happen on 1 machine - but if you do separate pipelines, each of them can't use less than 1 machine, so you'll be using at least N all the time.
One pipeline might be easier to observe and monitor in the UI, and easier to deploy. That, though, depends on the structure of the pipeline: are you going to pipe all clients' data through the same transforms, or, say, have 1 read transform per client (say, if each client is reading from a different PubSub topic and writing to a different BigQuery table)? If it's all the same transforms, then you'll get the benefit of launching the pipeline once and not having to do anything at all when a client is added or removed (otherwise, you'll need to update the pipeline).
With several pipelines (one per client), it's easier to isolate the issues with different clients. E.g. you could stop processing individual clients one by one, or update them one by one (say, if you're testing out some experimental code and don't want to break all the clients at the same time if it's wrong). It becomes unlikely that a bug in the pipeline will cause one client's data to mix up with another client's data.

Significance of creating single pipeline having multiple input sources over multiple pipelines each having separate input sources defined?

Answers (1)

Related Questions