Reputation: 13
Could I set DataStream time window to a large value like 24 hours? The reason for the requirement is that I want to make data statistics based on the latest 24 hours client traffic to the web site. This way, I can check if there are security violations.
For example, check if a user account used multiple source IPs to log on to the web site. Or check how many unique pages a certain IP accessed in the latest 24 hours. If security violation is detected, the configured action will be taken in real time such as blocking the source IP or locking the relevant user account.
The throughput of the web site is around 200Mb/s. I think setting the time window to a large value will cause memory issue. Should I store the statistics results of each time window like 5 minutes into database?
Then make statistics based on database query for the date generated in the latest 24 hours?
I don't have any experience with big data analysis. Any advice will be appreciated.
Upvotes: 1
Views: 1058
Reputation: 957
It depends on what type of window and aggregations we're talking about:
Window where no eviction is used: in this case Flink will only save one accumulated result per physical window. This means that for a sliding window of 10h with 1h slide that computes a sum it would have to have a number 10 times. For a tumbling window (regardless of the parameters) it only saves the result of the aggregation once. However this is not the whole story: because state is keyed you have to multiply all of this for every distinct value of the field used in the group by.
Window with eviction: saves all events that were processed but still weren't evicted.
In short, generally the memory consumption is not tied to how many events you processed or the window's durations but to:
All things considered, I'd say a simple 24-hour window has an almost nonexistent memory footprint.
You can check the relevant code here.
Upvotes: 1