Is this a good use-case for Dataflow?

Question

We currently are using google taskqueues to batch up requests to store analytics data into Keen and Stathat (more performant with batch puts). In order to consume from the taskqueues, we have a set of process brokers and workers to consume from the taskqueues. Seeing as dataflow is something where we just write the logic for pushing to our analytics solutions and we can specify a batch size to pull when processing in our dataflow program, I was curious if the overhead (seems more taylored to much larger applications) of dataflow is a good fit.

Jeremy Lewi · Accepted Answer

Your use case seems like a good one for Dataflow. Rather than publishing to a task queue you could publish to pubsub as a way to stream your data to your Dataflow job. Your Dataflow job could use Dataflow windows and triggers to batch your data based on size and/or time. You could then write each batch to your datastore.

Dataflow should work well on small datasets. The overhead would likely be in the cost of unused CPU cycles of Dataflow workers. Dataflow allows you to control the number of workers so you can allocate a number of workers suitable for your data size.

Utilization will depend on how evenly your load is spread out in time. If your peak and average loads are quite different then you can make a tradeoff between latency and utilization. If you want to maintain low latency then you can pick the number of workers so that you keep up during peak times. On the other hand if you want to maximize utilization, you can provision the number of workers based on average load. During peak times you would start to accumulate a backlog of messages in pubsub. The system would get rid of that backlog during non-peak times when there was spare capacity.

Right now Dataflow doesn't support writing custom sinks for unbounded data. One way to work around this is to do the writes from a DoFn rather than a sink. This should work just fine provided you can do your writes in an idempotent way so that writing a record multiple times won't cause problems.

Windowing and triggers are a way of dividing your data into finite batches to which aggregations (e.g. grouping, summing, etc...) can be applied. This blog post explains it better than I could (look at the section "windowing").

Is this a good use-case for Dataflow?

Answers (1)

Related Questions