David542
David542

Reputation: 110382

Trillion-row public dataset?

I am performing a few benchmarks on a database, and I was wondering if there are any publicly-available datasets that contain over 1T rows?

I know Google Big Query has a few publicly available datasets that are 100M+ (wikipedia, gdelt-events) and 1B+ (nyc-tlc) rows, but couldn't find anything larger. Does anyone know of a 1T-row dataset that can be either downloaded?

A few reference links:

Upvotes: 0

Views: 1105

Answers (2)

Graham Polley
Graham Polley

Reputation: 14791

There are the benchmark wiki tables. It's got the biggest public table that I've seen. The largest table is 106B rows (6.76TB). If you really wanted a trillion rows, you could simply run ~10* copy append jobs!

https://bigquery.cloud.google.com/table/bigquery-samples:wikipedia_benchmark.Wiki100B?tab=details

Upvotes: 3

NikoNyrh
NikoNyrh

Reputation: 4138

Wouldn't it be easier to just generate the dataset? Sure the question remains that how realistic its value distributions and co-correlations are, and how large impact this has on the measured performance.

And if you can assume that the cluster scales linearly you could just benchmark with 5% of data and 5% of the number of nodes you expect to have the production cluster. Regardless of the dataset size you just choose the number of nodes so that they can perform the needed number of requests / minute.

Taking a backup of that size database must be quite interesting problem, especially if it is constantly being updated.

Upvotes: 1

Related Questions