Reputation: 659
I am supposed to join some huge SQL tables with the json of some REST services by some common key ( we are talking about multiple sql tables with a few REST services calls ). The thing is this data is not real time/ infinite stream and also don’t think I could order the output of the REST services by the join columns. Now the silly way would be to bring all data and then match the rows, but that would imply to store everything in memory/ some storage like Cassandra or Redis.
But, I was wondering if flink could use some king of stream window to join say X elements ( so really just store in RAM just those elements at a point ) but also storing the nonmatched element for later match in maybe some kind of hash map. This is what I mean by smart join.
Upvotes: 0
Views: 648
Reputation: 43707
The devil is in the details, but yes, in principle this kind of data enrichment is quite doable with Flink. Your requirements aren't entirely clear, but I can provide some pointers.
For starters you will want to acquaint youself with Flink's managed state interfaces. Using these interfaces will ensure your application is fault tolerant, upgradeable, rescalable, etc.
If you wanted to simply preload some data, then you might use a RichFlatmap
and load the data in the open() method. In your case a CoProcessFunction
might be more appropriate. This is a streaming operator with two inputs that can hold state and also has access to timers (which can be used to expire state that is no longer needed, and to emit results after waiting for out-of-order data to arrive).
Flink also has support for asynchronous i/o, which can make working with external services more efficient.
One could also consider approaching this with Flink's higher level SQL and Table APIs, by wrapping the REST service calls as user-defined functions.
Upvotes: 1