Reputation: 3665
I'm looking for a storage solution that is (only) fast for elementary queries and counts on a large but simple data set; there are no other constraints.
The data consists of 500 million items shaped like the following:
{
a: VARCHAR(100)
b: VARCHAR(100)
c: VARCHAR(100)
}
I want to solve the following elementary queries fast:
“give me page X of items that where a = X AND b = Y AND and C = Z
”
Any of the three constraints are optional, e.g, a = X
is also a valid query.
So I only want to query on field equality within a single row.
(Other queries are not important. Joins will never be necessary.)
Additionally, I want to have fast count estimates for each of those queries.
For instance, I'd like to know that there are approximately 2 million items where a = X
.
What storage solutions / configurations are the best match and why?
Update: it might also be good to know that the underlying data rarely changes and is added all at once in batches of ± 20 million items.
LOAD DATA
.
Upvotes: 2
Views: 268
Reputation: 2552
How frequent are your terms, e.g are all a, b and c entries different? If not, mapping them to integers would perhaps be a good idea to get queries faster.
Regardless, I suggest you to try the free and Open Source database MonetDB (http://www.monetdb.org/) (Disclosure: I work for the research group that builds it), the columnar approach might work very well here. I recently used it for a similar use case with over 100 billion rows, and it works great.
Upvotes: 1