Paweł Szczur
Paweł Szczur

Reputation: 5542

Real-time pipeline feedback loop

I have a dataset with potentially corrupted/malicious data. The data is timestamped. I'm rating the data with a heuristic function. After a period of time I know that all new data items coming with some IDs needs to be discarded and they represent a significant portion of data (up to 40%).

Right now I have two batch pipelines:

  1. First one just runs the rating over the data.
  2. The second one first filters out the corrupted data and runs the analysis.

I would like to switch from batch mode (say, running every day) into an online processing mode (hope to get a delay < 10 minutes).

The second pipeline uses a global window which makes processing easy. When the corrupted data key is detected, all other records are simply discarded (also using the discarded keys from previous days as a pre-filter is easy). Additionally it makes it easier to make decisions about the output data as during the processing all historic data for a given key is available.

The main question is: can I create a loop in a Dataflow DAG? Let's say I would like to accumulate quality-rates given to each session window I process and if the rate sum is over X, some a filter function in earlier stage of pipeline should filter out malicious keys.

I know about side input, I don't know if it can change during runtime.

I'm aware that DAG by definition cannot have cycle, but how achieve same result without it?

Idea that comes to my mind is to use side output to mark ID as malicious and make fake unbounded output/input. The output would dump the data to some storage and the input would load it every hour and stream so it can be joined.

Upvotes: 2

Views: 597

Answers (1)

jkff
jkff

Reputation: 17913

Side inputs in the Beam programming model are windowed.

So you were on the right path: it seems reasonable to have a pipeline structured as two parts: 1) computing a detection model for the malicious data, and 2) taking the model as a side input and the data as a main input, and filtering the data according to the model. This second part of the pipeline will get the model for the matching window, which seems to be exactly what you want.

In fact, this is one of the main examples in the Millwheel paper (page 2), upon which Dataflow's streaming runner is based.

Upvotes: 1

Related Questions