Dmitry Sazhnev
Dmitry Sazhnev

Reputation: 383

Big Data time lapse queries

I have an access log file with 10 billions rows in it. Each row consist of timestamp and user cookie string. Let's pretend for simplicity that each user has only one permanent cookie string. I need to make system which can return a number of unique visitors for given time lapse. Time lapse must be at least 1 day and at most 3 years. For example: number of unique users from May 26 to September 10. I also have only 4gb RAM and infinite HDD. Please any ideas on which DBMS I better use for this, and what schema design is better. I've never dealt with such a big pieces of data like this.

Upvotes: 0

Views: 65

Answers (1)

Jens Roland
Jens Roland

Reputation: 27770

A really great way to do this efficiently is using Redis' builtin BITFIELD or SET features. Basically, you store an entry per day containing either a set of the unique identifiers for that day (in the case of the SET implementation) or a bit field where each position represents a distinct cookie ID (note that these positions must be consistent over time, which gets tricky if you can't enumerate your IDs beforehand, as with cookie IDs that have a high churn rate).

There is a fantastic article about this by Avichal Garg @avichal on GetSpool.com in which they demonstrate a fantastic realtime querying performance for this exact use case:

In a simulation of 128 million users, a typical metric such as “daily unique users” takes less than 50 ms on a MacBook Pro and only takes 16 MB of memory.

Note that this solution lets you not only count the uniques, but in fact can tell you exactly WHICH users -- not in a sampled or approximate HyperLogLog kind of way, but in a real, complete and exact list of users kind of way.

I used the same method in production in a previous job and I can verify their results.

Upvotes: 1

Related Questions