Gibbs
Gibbs

Reputation: 22956

Is it right use-case of KSql

I am using KStreams where I need to de-duplicate the data. Source ingests duplicated data due to many reasons i.e data itself duplicate, re-partitioning.

Currently using Redis for this use-case where data is stored something as below

id#object list-of-applications-processed-this-id-and-this-object

As KSQL is implemented on top of RocksDB which is also a Key-Value database, can I use KSql for this use case?

At the time of successful processing, I would add an entry to KSQL. At the time of reception, I will have to check the existence of the id in KSQL.

Is it correct use case as per KSql design in the event processing world?

Upvotes: 0

Views: 411

Answers (1)

Matthias J. Sax
Matthias J. Sax

Reputation: 62285

If you want to use to use ksqlDB as a cache, you can create a TABLE using the topic as data source. Note that a CREATE TABLE statement by itself, does only declare a schema (it does not pull in any data into ksqlDB yet).

CREATE TABLE inputTable <schemaDefinition> WITH(kafka_topic='...');

Check out the docs for more details: https://docs.ksqldb.io/en/latest/developer-guide/ksqldb-reference/create-table/

To pull in the data, you can create a second table via:

CREATE TABLE cache AS SELECT * FROM inputTable;

This will run a query in the background, that read the input data and puts the result into the ksqlDB server. Because the query is a simple SELECT * it effectively pulls in all data from the topic. You can now issue "pull queries" (i.e, lookups) against the result to use TABLE cache as desired: https://docs.ksqldb.io/en/latest/developer-guide/ksqldb-reference/select-pull-query/

Future work:

We are currently working on adding "source tables" (cf https://github.com/confluentinc/ksql/pull/7474) that will make this setup simpler. If you declare a source table, you can do the same with a single statement instead of two:

CREATE SOURCE TABLE cache <schemaDefinition> WITH(kafka_topic='...');

Upvotes: 2

Related Questions