Jeremy Smith
Jeremy Smith

Reputation: 15069

Best data store for dealing with real time queries against billions of rows of sequential data?

This is similar to another question that was asked, but there are key differences in my requirements. I need to store billions of rows, but they will only be searched on per user_id, and any given user is not likely to have more than a 10 million rows of data. Given that I'm never searching across the entire dataset, do I even have to treat this like an unusual requirement?

There are hundreds of columns of Boolean and Float data that would be used to produce statistics, I can't rely on summary tables for these searches since the criteria will be unpredictable.

Also, my data is sequential, and will need to be accessed using real time searches based on user_id and a range of time (with an ad hoc set of other conditions). Speed is much more important than reliability.

Upvotes: 1

Views: 1852

Answers (2)

RC_Cleland
RC_Cleland

Reputation: 2294

By your description of the volume of data you will be searching given a user_id and date range I suspect you will be spending the majority of time waitining for disk access. My first though is to optimize the hard disk subsystem.

For the database each of the databases you memtion and Oracle, SQL Server could do a good job of passing the data from the hard disk to the application while performing some calculations along the way. The question I have for you is when you are standing in front of the president of the company after reporting a database error are you going to say "I have posted a message with the user group and will wait until I hear back from someone" or "I have company X on the line and we are working to resolve the issue"

Upvotes: 0

Tom Clarkson
Tom Clarkson

Reputation: 16174

Storing billions of rows generally becomes a problem because you run out of disk space on a single server and partitioning non-trivial datasets can be difficult. You don't have this problem because rather than one huge dataset you have a thousand more reasonably sized datasets.

I'd recommend using a data store that lets you create a completely separate table (or database) for each user. Although this is generally not considered a good idea when designing a SQL database, most of the schemaless stores can handle it reasonably well.

As well as allowing you to partition the data across servers easily (you probably don't need to parallellize search within a single user dataset), this will eliminate the largest index entirely and keep the others to a reasonable size.

Upvotes: 1

Related Questions