Aliakbar Abbasi
Aliakbar Abbasi

Reputation: 249

a data structure to query number of events in different time interval

My program receives thousands of events in a second from different types. For example 100k API access in a second from users with millions of different IP addresses. I want to keep statistics and limit number of accesses in 1 minute, 1 hour, 1 day and so on. So I need event counts in last minute, hour or day for every user and I want it to be like a sliding window. In this case, type of event is the user address.

I started using a time series database, InfluxDB; but it failed to insert 100k events per second and aggregate queries to find event counts in a minute or an hour is even worse. I am sure InfluxDB is not capable of inserting 100k events per second and performing 300k aggregate queries at the same time.

I don't want events retrieved from the database because they are just a simple address. I just want to count them as fast as possible in different time intervals. I want to get the number of events of type x in a specific time interval (for example, past 1 hour).

I don't need to store statistics in the hard disk; so maybe a data structure to keep event counts in different time intervals is good for me. On the other hand, I need it to be like a sliding window.

Storing all the events in RAM in a linked-list and iterating over it to answer queries is another solution that comes to my mind but because the number of events is too high, keeping all of the events in RAM could not be a good idea.

Is there any good data structure or even a database for this purpose?

Upvotes: 1

Views: 607

Answers (2)

Yuri Lachin
Yuri Lachin

Reputation: 1500

You didn't provide enough details on events input format and how events can be delivered to statistics backend: is it a stream of udp messages, http put/post requests or smth else.

One possible solution would be to use Yandex Clickhouse database. Rough description of suggested pattern:

  1. Load incoming raw events from your application into memory-based table Events with Buffer storage engine
  2. Create materialized view with per-minute aggregation in another memory-based table EventsPerMinute with Buffer engine
  3. Do the same for hourly aggregation of data in EventsPerHour
  4. Optionally, use Grafana with clickhouse datasource plugin to build dashboards

In Clickhouse DB Buffer storage engine not associated with any on-disk table will be kept entirely in memory and older data will be automatically replaced with fresh. This will give you simple housekeeping for raw data.

Tables (materialized views) EventsPerMinute and EventsPerHour can be also created with MergeTree storage engine if case you want to keep statistics on disk. Clickhouse can easily handle billions of records.

At 100K events/second you may need some kind of shaper/load balancer in front of database.

Upvotes: 1

binboavetonik
binboavetonik

Reputation: 192

you can think of a hazelcast cluster instead of simple ram. I also think a graylog or simple elastic seach but with this kind of load you shoud test. You can think about your data structure as well. You can construct a hour map for each address and put the event into the hour bucket. And when the time passes the hour you can calculate the count and cache in this hour's bucket. When you need a minute granularity you go to hours bucket and count the events under the list of this hour.

Upvotes: 0

Related Questions