Ruben Verborgh
Ruben Verborgh

Reputation: 3665

Database storage with fast elementary queries and counts for 500M items

I'm looking for a storage solution that is (only) fast for elementary queries and counts on a large but simple data set; there are no other constraints.

The data consists of 500 million items shaped like the following:

{
    a: VARCHAR(100)
    b: VARCHAR(100)
    c: VARCHAR(100)
}

I want to solve the following elementary queries fast:
“give me page X of items that where a = X AND b = Y AND and C = Z
Any of the three constraints are optional, e.g, a = X is also a valid query.
So I only want to query on field equality within a single row.
(Other queries are not important. Joins will never be necessary.)

Additionally, I want to have fast count estimates for each of those queries.
For instance, I'd like to know that there are approximately 2 million items where a = X.

What storage solutions / configurations are the best match and why?

Update: it might also be good to know that the underlying data rarely changes and is added all at once in batches of ± 20 million items.



I currently tried ingesting the data to a MySQL database, with indexes on each of the columns. I got as far as 100 million items and the query performance is very good. However, the server has become extremely unstable and the import process halts randomly. I'm using LOAD DATA.

Upvotes: 2

Views: 268

Answers (1)

Hannes Mühleisen
Hannes Mühleisen

Reputation: 2552

How frequent are your terms, e.g are all a, b and c entries different? If not, mapping them to integers would perhaps be a good idea to get queries faster.

Regardless, I suggest you to try the free and Open Source database MonetDB (http://www.monetdb.org/) (Disclosure: I work for the research group that builds it), the columnar approach might work very well here. I recently used it for a similar use case with over 100 billion rows, and it works great.

Upvotes: 1

Related Questions