Reputation: 1791
I have written an application that essentially sniffs an Ethernet Device, and studies certain patterns. I am using Python and Scapy to capture data. Since the data needs to be captured in a database for posterity, and for pattern studies, we are considering the following strategy.
1) We want to use a high performance key-value store to capture basic data. This would be fundamentally a a key:value store with around 200 keys. 2) Every one hour we will pool the key store, and based on certain conditions and patterns we shall fill up a Postgres Database, based on values stored in K:V store.
We are planning to use Redis for the K:V. We had considered other solutions including database, files based caches etc, but there are performance bottlenecks. For one there are several thousands of packets that gets processed every minute, and SQL calls back and forth from a database slows down the program.
I have never used Redis. But I am told it's the fastest and most efficient K:V No SQL data store. And the redis python APi makes it very Pythonic.Essentially the database store would have 200 odd keys and a value in long ints associated with 80% of keys, or in some cases, char fields that are less than 200 characters.
Questions
1) Is this is the right approach? 2) What are the other parameters to consider? 3) How much would the memory scale? What all should I do to ensure that memory size is optimized for faster performance? 4) How do I calculate memory sizes?
Python is the only language we know. So any suggestion like C/C++ may not appeal.
We are Ok with a few packets being lost once in a while because the idea is to study patterns rather than come with absolute accurate results. Number of keys would remain the same, and their values can go up and down..
We need finally calculated data to be stored in a RDBMS, because the future mainpulations are SQL intensive.
Upvotes: 3
Views: 2412
Reputation: 73206
1) Is this is the right approach?
Well it can certainly be implemented like this, and Redis is fast enough to sustain this kind of workload. Your bottleneck will be your Python code, more than Redis itself.
2) What are the other parameters to consider?
You may want to accumulate your data in memory (dictionary) rather than in Redis. Except if you configure Redis with full-fsync AOF (which makes it slow), Redis is not much more resilient to system crash than your Python process.
However if you have several capture processes and you need to aggregate the data before storing them in PostgreSQL, then Redis is a very good solution.
3) How much would the memory scale? What all should I do to ensure that memory size is optimized for faster performance?
If you have 200 values, then memory consumption is a non issue (since it will be negligible). Redis is already fast enough for this kind of workload, you don't need to use any fancy trick here.
However, you should maintain a list of your keys (so you can access them without relying on the KEYS command), and use pipelining to retrieve your data in an efficient way (i.e. not key by key). Consider using the SORT command to fetch everything in one shot if you have multiple keys or consider defining a unique hash object to store your 200 keys/values and retrieve them in one shot.
4) How do I calculate memory sizes?
It is useless here. But if you really have to, start a Redis instance, put your data, and use the INFO command to get statistics. You can also dump the data, and use the following Python script to get statistics from the dump file: https://github.com/sripathikrishnan/redis-rdb-tools
Upvotes: 4