Shadow
Shadow

Reputation: 123

What is the best way to do metadata join in Flink datastream?

We have a kafka stream of events that we want to enrich using some metadata that resides inside MySQL DB.

The metadata changes every few hours. Essentially we want to periodically read the DB and keep enriching the events with this new metadata.

One way could be to use Broadcast state with a periodic Source that reads DB every few minutes/hours. Broadcast this stream and use it to join. But the problem could be that the first read of the broadcast stream can be later than some of the messages being read from Kafka Stream.

Is there any better way?

Upvotes: 0

Views: 696

Answers (1)

David Anderson
David Anderson

Reputation: 43697

You could use Flink SQL for this. Depending on the exact requirements, you might either do time-versioned joins against a CDC stream from the MySQL DB, or do lookup joins against MySQL (perhaps with the optional cache enabled).

See also https://github.com/ververica/flink-cdc-connectors.

Update:

If you want to use the DataStream API, but are concerned that some of the kafka messages might be processed before the corresponding data from the broadcast stream is available, you can:

  • in the open() method of your enrichment function, do an initial query against MySQL to preload the metadata
  • if the broadcast data is still unavailable when it's time to do the join, use the data fetched during the open(), or use some default value hardwired into the code

Alternatively, you could use the state processor API to bootstrap values for the broadcast state.

Upvotes: 0

Related Questions