Reputation: 73
Currently have a Flink job that is using the broadcast state pattern, connecting the broadcasted stream to an event stream to provide context for decision making. The broadcasted data contains relatively small objects and is a medium throughput stream.
Since broadcasted state can only be manipulated on the broadcast side, the current method for cleaning this data to prevent it from infinitely growing in size (since it is stored in memory) is to iterate through the entire state in the processBroadcastElement
method, removing items that meet certain qualifications based on specific object properties.
For normal operation, this works fine and does not seem to strain processing power. However, there have been a couple of situations where the state needs to be cleared and backloaded with hundreds of thousands of broadcast stream objects (currently adding up to about 15 MB * 2 parallel instances in checkpoint size). In these situations, the job immediately becomes 100% busy on the Co-Process-Broadcast operator, providing 100% backpressure on both data sources.
A few potential solutions that might be better:
MapState
for the currently broadcasted data and a keyed event stream so I can access the state in a rich map function and clean the state there if neededMapState
for the currently broadcasted data and a keyed event stream so I can access the state in a rich map function and clean the full state on a timer interval in a keyed process function on the event streamLooking for feedback on which option is the most "correct" pattern to use for this situation, along with which would be most performant.
Upvotes: 0
Views: 39