Reputation: 1254
I'm looking at building out the architecture for the following, and wanted to see what others think about it.
Assume that the system is running some non-trivial algorithm (so it's not simply a sum of something etc.) on the data collected on each user. Some users will have 10 rows of data, some will have tens of thousands. The data will be user geo positions over time. There would be upwards of 10-100M users, and the data on many users is coming in every day, potentially every minute for some.
At periodic intervals (1/5/15 minutes, basically as soon as possible), I'd want to run that non-trivial algorithm on each users data, which would spit out couple of numbers that would then be reported out.
One way to model that would be to store in a NoSQL db and process each users data on an Akka cluster. Any recommendation for the DB?
The user data here is basically an append log where once added, data won't change - but it keeps growing all the time, and some users have disproportionately more data than others. In order to process the data per user, all of it needs to be loaded into memory somewhere, so the best possible scenario is where all data is in memory and re-processed at one minute interval - the downside being that I would need terabytes of RAM to do that and if the in-memory servers go down, all of the data would need to re-loaded and that would take a while.
Upvotes: 2
Views: 313
Reputation: 7744
I'm currently working on a similar problem. My system has around 35.000 million "records", each record has about 4-5 values in it. I'm currently able to process them (a non-trivial processing) in about 20 hours on a single mid-range desktop (6 core AMD, with spinning platters).
For storage, I tried almost everything, starting from Postgres, moved to Cassandra, Hypertable. Then I realized, that my use-case only involved replaying the data in sequence, with no need for random access either in writes or reads. I found Chronicle, which is exactly what I was looking for. Since I didn't have enough RAM to store all the data, I needed to read everything from disk, and with Chronicle I got up to about 800.000 records / second.
I don't know the current version of Chronicle, but the version I used created an "index" file which I found was superfluous. Since then I'm using my own code, which is basically Chronicle (memory-mapped files) without the index file, that gets me up to 1.300.000 records / second on my pretty average 30 MB/sec spinning disk.
Another point for storage is to compress your data. It makes a huge difference. I wrote a compression for my data that is bit aligned (when I compress a value to 3 bits, it really just writes 3 bits, not 8). I found using compressing using byte boundaries to be 30-40% worse (on my data). I would expect for example, that GPS data from one person does not change rapidly, therefore each consecutive data point might only need just a couple of bits.
Since I didn't need real-time processing like you, my primary objective was as much processing performance as possible on one (or at least just a few) machines. I tried Akka, Hadoop (which is just a PITA, wouldn't recommend it), played around which Apache Spark. My problem was that most of these is made to be run in a large cluster, and are not as fast as I want on a single machine (or at least, I couldn't make them as fast as I wanted).
I ended up just implementing a processing chain myself, which as I said comes out to about 500.000 records / second processing with I/O. Since my data is easily split into independent shards, I can scale out without having to coordinate the nodes anyway.
If you have much more data, and need real-time processing, you probably have to scale out a lot more than I did, and then individual performance might not be the most important part.
Anyway, I hope some of this helps.
Upvotes: 2