dsp_099
dsp_099

Reputation: 6121

How to deal with a large amount of logs and redis?

Say I have about 150 requests coming in every second to an api (node.js) which are then logged in Redis. At that rate, the moderately priced RedisToGo instance will fill up every hour or so.

The logs are only necessary to generate daily\monthly\annual statistics: which was the top requested keyword, which was the top requested url, total number of requests daily, etc. No super heavy calculations, but a somewhat time-consuming run through arrays to see which is the most frequent element in each.

If I analyze and then dump this data (with a setInterval function in node maybe?), say, every 30 minutes, it doesn't seem like such a big deal. But what if all of sudden I have to deal with, say, 2500 requests per second?

All of a sudden I'm dealing with 4.5 ~Gb of data per hour. About 2.25Gb every 30 minutes. Even with how fast redis\node are, it'd still take a minute to calculate the most frequent requests.

Questions: What will happen to the redis instance while 2.25 gb worth of dada is being processed? (from a list, I imagine)

Is there a better way to deal with potentially large amounts of log data than moving it to redis and then flushing it out periodically?

Upvotes: 2

Views: 2381

Answers (2)

Didier Spezia
Didier Spezia

Reputation: 73246

IMO, you should not use Redis as a buffer to store your log lines and process them in batch afterwards. It does not really make sense to consume memory for this. You will better served by collecting your logs in a single server and write them on a filesystem.

Now what you can do with Redis is trying to calculate your statistics in real-time. This is where Redis really shines. Instead of keeping the raw data in Redis (to be processed in batch later), you can directly store and aggregate the statistics you need to calculate.

For instance, for each log line, you could pipeline the following commands to Redis:

zincrby day:top:keyword 1 my_keyword
zincrby day:top:url 1 my_url
incr day:nb_req

This will calculate the top keywords, top urls and number of requests for the current day. At the end of the day:

# Save data and reset counters (atomically)
multi
rename day:top:keyword tmp:top:keyword
rename day:top:url tmp:top:url
rename day:nb_req tmp:nb_req
exec

# Keep only the 100 top keyword and url of the day
zremrangebyrank tmp:top:keyword 0 -101
zremrangebyrank tmp:top:url 0 -101

# Aggregate monthly statistics for keyword
multi    
rename month:top:keyword tmp
zunionstore month:top:keyword 2 tmp tmp:top:keyword
del tmp tmp:top:keyword
exec

# Aggregate monthly statistics for url
multi    
rename month:top:url tmp
zunionstore month:top:url 2 tmp tmp:top:url
del tmp tmp:top:url
exec

# Aggregate number of requests of the month
get tmp:nb_req
incr month:nb_req <result of the previous command>
del tmp:nb_req

At the end of the month, the process is completely similar (using zunionstore or get/incr on monthly data to aggregate the yearly data).

The main benefit of this approach is the number of operations done for each log line is limited while the monthly and yearly aggregation can easily be calculated.

Upvotes: 6

ali haider
ali haider

Reputation: 20202

how about using flume or chukwa (or perhaps even scribe) to move log data to a different server (if available) - you could store log data using hadoop/hbase or any other disk based store.

https://cwiki.apache.org/FLUME/

http://incubator.apache.org/chukwa/

https://github.com/facebook/scribe/

Upvotes: 1

Related Questions