How to do content based deduplication using flink sql

Question

I have my flink sql statement as follows

CREATE OR REPLACE TABLE table_one /** mode('streaming')*/
(
        `pk` string,
        `id` string,
        `segments` ARRAY,
        `headers` MAP METADATA,
        `kafka_key` STRING,
        `ts` timestamp(3) METADATA FROM 'timestamp'VIRTUAL,
         WATERMARK FOR `ts` AS `ts` - INTERVAL '1' SECOND,
        `hard_deleted` boolean
)WITH (
        'connector' = 'kafka',
        'properties.bootstrap.servers' = 'kafka:29092',
        'properties.group.id' = 'grp1',
        'topic-pattern' = 'topic_one',
        'value.format' = 'json',
        'format' = 'json',
        'key.format' = 'raw',
        'key.fields' = 'kafka_key',
        'value.fields-include' = 'EXCEPT_KEY',
        'scan.startup.mode' = 'earliest-offset',
        'json.timestamp-format.standard' = 'ISO-8601',
        'json.fail-on-missing-field' = 'false',
        'json.ignore-parse-errors' = 'true'
);

create or replace view table_one_source  as (
SELECT cast(`headers`['pod'] as varchar)            as pod,
       cast(`headers`['org'] as varchar)            as org,
       cast(`headers`['tenantId'] as varchar)       as tenantId,
       kafka_key                                    as pk,
       COALESCE(id, SPLIT_INDEX(kafka_key, '#', 1)) as id,
       segments,
       ts
FROM table_one
WHERE `headers`['tenantId'] is not null
  AND `headers`['pod'] is not null
  AND `headers`['org'] is not null
);

Create or replace view table_one_source_keyed as (
    WITH table_one_source_hash AS (
        SELECT
            pod, org, tenantId, pk, id, segments, ts,
            HASH_CODE(tenantId || id || CAST(segments AS STRING)) AS data_hash
        FROM table_one_source
    ),
    entitlement_source_deduped as (
         SELECT *
         FROM (
             SELECT *,
                    LAG(data_hash) OVER (PARTITION BY tenantId, id ORDER BY ts) AS prev_data_hash
             FROM table_one_source_hash
             )
         WHERE data_hash IS DISTINCT FROM prev_data_hash OR prev_data_hash IS NULL
    )
    
select * from entitlement_source_deduped
);

Goal here is that I only want new or (values of these different from previous) id and segments from table_one to flow downstream. The above sql work. It produces dag like this

It using :OverAggregate(partitionBy=[$5, $6], orderBy=[ts ASC], window=[ RANG BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW], select=[segments, kafka_key, ts, $3, $4, $5, $6, $7, LAG($7) AS w0$o0]) . Window for OverAggregate seems to unbounded. Also worried that state of this operator can really grow big.

Question: Is there another way to deduplicate based on content of the message.

How to do content based deduplication using flink sql

Answers (1)

Related Questions