Reputation: 1781
In the e2e Flink SQL tutorial the source table is defined as a Kafka-sourced table with timestamp column upon which watermarking is enabled
CREATE TABLE user_behavior (
user_id BIGINT,
item_id BIGINT,
category_id BIGINT,
behavior STRING,
ts TIMESTAMP(3),
proctime AS PROCTIME(), -- generates processing-time attribute using computed column
WATERMARK FOR ts AS ts - INTERVAL '5' SECOND -- defines watermark on ts column, marks ts as event-time attribute
) WITH (
'connector' = 'kafka', -- using kafka connector
'topic' = 'user_behavior', -- kafka topic
'scan.startup.mode' = 'earliest-offset', -- reading from the beginning
'properties.bootstrap.servers' = 'kafka:9094', -- kafka broker address
'format' = 'json' -- the data format is json
);
As long as GROUP BY is made by a TUMBLE upon ts
field, it seems natural (since Flink knows when to trigger / eject the windows) but in the middle of the tutorial we see the following expression
INSERT INTO cumulative_uv
SELECT date_str, MAX(time_str), COUNT(DISTINCT user_id) as uv
FROM (
SELECT
DATE_FORMAT(ts, 'yyyy-MM-dd') as date_str,
SUBSTR(DATE_FORMAT(ts, 'HH:mm'),1,4) || '0' as time_str,
user_id
FROM user_behavior)
GROUP BY date_str;
Here we see that GROUP BY is made on derivative date_str
field, but how does watermarking works here? How does Flink decides when to "close" date_str bucket? Since date_str
is some function over ts
, it must somehow understand how the watermark update for ts
would translate into waterlevel for date_str
field which seems unfeasable to me. How does it work internally, does Flink stores all encountered records in it's state?
Upvotes: 0
Views: 720
Reputation: 823
Perhaps you can refer to the link below to learn about the generation and delivery of Watermarks, especially "How Operators Process Watermarks"
In this example, the watermark is generated from the ts of the source operator, and the downstream operator will only process the watermark, which has nothing to do with the date_str field.
public class TimestampsAndWatermarksOperator<T> extends AbstractStreamOperator<T>
implements OneInputStreamOperator<T, T>, ProcessingTimeCallback {
......
@Override
public void open() throws Exception {
super.open();
timestampAssigner = watermarkStrategy.createTimestampAssigner(this::getMetricGroup);
watermarkGenerator =
emitProgressiveWatermarks
? watermarkStrategy.createWatermarkGenerator(this::getMetricGroup)
: new NoWatermarksGenerator<>();
wmOutput = new WatermarkEmitter(output);
watermarkInterval = getExecutionConfig().getAutoWatermarkInterval();
if (watermarkInterval > 0 && emitProgressiveWatermarks) {
final long now = getProcessingTimeService().getCurrentProcessingTime();
getProcessingTimeService().registerTimer(now + watermarkInterval, this);
}
}
@Override
public void processElement(final StreamRecord<T> element) throws Exception {
final T event = element.getValue();
final long previousTimestamp =
element.hasTimestamp() ? element.getTimestamp() : Long.MIN_VALUE;
final long newTimestamp = timestampAssigner.extractTimestamp(event, previousTimestamp);
element.setTimestamp(newTimestamp);
output.collect(element);
watermarkGenerator.onEvent(event, newTimestamp, wmOutput);
}
......
@Override
public void processWatermark(org.apache.flink.streaming.api.watermark.Watermark mark)
throws Exception {
// if we receive a Long.MAX_VALUE watermark we forward it since it is used
// to signal the end of input and to not block watermark progress downstream
if (mark.getTimestamp() == Long.MAX_VALUE) {
wmOutput.emitWatermark(Watermark.MAX_WATERMARK);
}
}
......
}
Upvotes: 0