Streaming Flink SQL with GROUP BY over not timestamp column

Question

In the e2e Flink SQL tutorial the source table is defined as a Kafka-sourced table with timestamp column upon which watermarking is enabled

CREATE TABLE user_behavior (
    user_id BIGINT,
    item_id BIGINT,
    category_id BIGINT,
    behavior STRING,
    ts TIMESTAMP(3),
    proctime AS PROCTIME(),   -- generates processing-time attribute using computed column
    WATERMARK FOR ts AS ts - INTERVAL '5' SECOND  -- defines watermark on ts column, marks ts as event-time attribute
) WITH (
    'connector' = 'kafka',  -- using kafka connector
    'topic' = 'user_behavior',  -- kafka topic
    'scan.startup.mode' = 'earliest-offset',  -- reading from the beginning
    'properties.bootstrap.servers' = 'kafka:9094',  -- kafka broker address
    'format' = 'json'  -- the data format is json
);

As long as GROUP BY is made by a TUMBLE upon ts field, it seems natural (since Flink knows when to trigger / eject the windows) but in the middle of the tutorial we see the following expression

INSERT INTO cumulative_uv
SELECT date_str, MAX(time_str), COUNT(DISTINCT user_id) as uv
FROM (
  SELECT
    DATE_FORMAT(ts, 'yyyy-MM-dd') as date_str,
    SUBSTR(DATE_FORMAT(ts, 'HH:mm'),1,4) || '0' as time_str,
    user_id
  FROM user_behavior)
GROUP BY date_str;

Here we see that GROUP BY is made on derivative date_str field, but how does watermarking works here? How does Flink decides when to "close" date_str bucket? Since date_str is some function over ts, it must somehow understand how the watermark update for ts would translate into waterlevel for date_str field which seems unfeasable to me. How does it work internally, does Flink stores all encountered records in it's state?

Streaming Flink SQL with GROUP BY over not timestamp column

Answers (1)

Related Questions