Reputation: 15069
This is similar to another question that was asked, but there are key differences in my requirements. I need to store billions of rows, but they will only be searched on per user_id, and any given user is not likely to have more than a 10 million rows of data. Given that I'm never searching across the entire dataset, do I even have to treat this like an unusual requirement?
There are hundreds of columns of Boolean and Float data that would be used to produce statistics, I can't rely on summary tables for these searches since the criteria will be unpredictable.
Also, my data is sequential, and will need to be accessed using real time searches based on user_id and a range of time (with an ad hoc set of other conditions). Speed is much more important than reliability.
Is HBase/Hypertable a prime candidate given the sequential nature of the data, and the large dataset? Again, would this even be considered a large dataset given that I'm usually searching on a few million rows or less, and at most 10 million rows?
Is Mongo not a good candidate because of the sequential nature of the data? I've read that since Mongo stores using a binary tree, that it's not a good candidate. I've also read that map reduce can't be parallelized, and so doesn't have great performance. If I have to use Hadoop, is that another reason to just go with HBase?
Is there another option that is best suited that I'm not considering?
Upvotes: 1
Views: 1852
Reputation: 2294
By your description of the volume of data you will be searching given a user_id and date range I suspect you will be spending the majority of time waitining for disk access. My first though is to optimize the hard disk subsystem.
For the database each of the databases you memtion and Oracle, SQL Server could do a good job of passing the data from the hard disk to the application while performing some calculations along the way. The question I have for you is when you are standing in front of the president of the company after reporting a database error are you going to say "I have posted a message with the user group and will wait until I hear back from someone" or "I have company X on the line and we are working to resolve the issue"
Upvotes: 0
Reputation: 16174
Storing billions of rows generally becomes a problem because you run out of disk space on a single server and partitioning non-trivial datasets can be difficult. You don't have this problem because rather than one huge dataset you have a thousand more reasonably sized datasets.
I'd recommend using a data store that lets you create a completely separate table (or database) for each user. Although this is generally not considered a good idea when designing a SQL database, most of the schemaless stores can handle it reasonably well.
As well as allowing you to partition the data across servers easily (you probably don't need to parallellize search within a single user dataset), this will eliminate the largest index entirely and keep the others to a reasonable size.
Upvotes: 1