Reputation: 333
I'm investigating whether or not to cache large datasets using Redis.
The largest of the datasets holds approximately 5 millions objects. Although each object has a unique identifier they're never used individually by the client; aggregate and join operations are performed on the whole dataset.
The target environment is 4 servers each with 144 Gb Ram, 24 cores and gigabit network cards - running Windows 2008 R2 enterprise. To that extent I've installed 10 instances of Redis-64.2.6.12.1 from Microsoft Open Technologies on each box. And I'm using ServiceStack's Redis client.
I've sharded the data into chunks of 1000 objects (this seems to give the best performance) and used the ShardedRedisClientManager to hash each chunk id to distribute the data across the 40 caches. An object map is persisted so that the client application can continue to retrieve all the objects using just the dataset id. Redis lists are used for both the objects and the object-map.
Transactions didn't improve the performance but, by grouping the chucks by connection, parallel processing did. However the performance is still unsatisfactory. The best time to set then get 5m objects plus the object-map is 268055 ms.
So, is there a better approach to caching large datasets using Redis? Is it even reasonable to cache such datasets? Should I make do with serializing to disk and move the processing to the data ala hadoop?
Upvotes: 2
Views: 2974
Reputation: 143284
The question isn't whether Redis is suitable for large datasets, it's whether or not your Dataset and use-case is suitable for Redis.
Redis essentially allows you to maintain distributed computer science collections and let you access and interact them in a Threadsafe atomic way in the optimal Big O notation performance each data collection type allows.
Redis may be fast, but it's still limited by Network latency and optimal data storage and access patterns, e.g. you still need to be concerned with number of Network round-trips and bandwidth that's required, whether you're data access requires full-table scans or can be reduced via custom indexes and the performance overhead of serialization library you're using.
It seems odd to want to transfer the entire DataSet each time, which may be an indication that you shouldn't be maintaining and itemizing the dataset into Redis server collections. If you're only accessing and manipulating the dataset on the client then there's no real benefit of storing the data into Redis collections.
If you're use-case is what's the fastest way I can get 5M objects hydrated into in-memory .NET data structures, than that would just be to store the entire dataset as a blob into a single GET/SET entry using a fast binary format like ProtoBuf or MessagePack. In this way Redis is only acting like a fast in-memory blob storage. If access to the datastore doesn't need to be distributed (i.e. accessed over a network) than a fast embedded datastore like Level DB would be more optimal.
For maximum performance you could go further and use GETRANGE/SETRANGE to read chunks from multiple replicated redis-servers or just chunkify the serialized binary blob across multiple sharded redis servers - although this means that chunks by themselves are useless without their entire aggregate, so a corrupted chunk would invalidate the entire dataset.
Upvotes: 1