Execute flink sink after tumbling window

Question

Source: Kinesis data stream

Sink: Elasticesearch

For both using AWS services.

Also, running my Flink job on AWS Kinesis data analytics application

I am facing an issue with the windowing function of flink. My job looks like this

DataStream input = ...; // input from kinesis stream
input.keyBy(e -> e.getArea())
                .window(TumblingProcessingTimeWindows.of(Time.seconds(10)))
                .reduce(new MyReduceFunction(), new MyProcessWindowFunction())
                .addSink();

 private static class MyReduceFunction implements ReduceFunction {
        @Override
        public TrackingData reduce(TrackingData trackingData, TrackingData t1) throws Exception {
            trackingData.setVideoDuration(trackingData.getVideoDuration() + t1.getVideoDuration());
            return trackingData;
        }
    }

private static class MyProcessWindowFunction extends ProcessWindowFunction {
        public void process(String key,
                            Context context,
                            Iterable in,
                            Collector out) {

            TrackingData trackingIn = in.iterator().next();

            Long videoDuration =0l;
            for (TrackingData t: in) {
                videoDuration += t.getVideoDuration();
            }
            trackingIn.setVideoDuration(videoDuration);
            out.collect(trackingIn);
        }
    }

sample event :

{"area":"sessions","userId":4450,"date":"2021-12-03T11:00:00","videoDuration":5}

What I do here is from the kinesis stream I got these events in a large amount I want to sum videoDuration for every 10 seconds of window then I want to store this single event into elasticsearch.

In Kinesis there can be 10,000 events per second. I don't want to store all 10,000 events in elasticsearch i just want to store only one event for every 10 seconds.

The issue is when I send an event to this job it quickly processes this event and directly sinks into elasticsearch but I want to achieve : till every 10 seconds I want events videoDuration time to be incremented and after 10 seconds only one event to be store in elasticearch.

How can I achieve this?

David Anderson · Accepted Answer

I think you've misdiagnosed the problem.

The code you've written will produce one event from each 10-second-long window for each distinct key that has events during the window. MyProcessWindowFunction isn't having any effect: since the window results have been pre-aggregated, each Iterable will contain exactly one event.

I believe you want to do this instead:

input.keyBy(e -> e.getArea())
                .window(TumblingProcessingTimeWindows.of(Time.seconds(10)))
                .reduce(new MyReduceFunction())
                .windowAll(TumblingProcessingTimeWindows.of(Time.seconds(10)))
                .reduce(new MyReduceFunction())
                .addSink();

You could also just do

input.windowAll(TumblingProcessingTimeWindows.of(Time.seconds(10)))
                .reduce(new MyReduceFunction())
                .addSink();

but the first version will be faster, since it will be able to compute the per-key window results in parallel before computing the global sum in the windowAll.

FWIW, the Table/SQL API is usually a better fit for this type of application, and should produce a more optimized pipeline than either of these.

Execute flink sink after tumbling window

Answers (1)

Related Questions