deamon
deamon

Reputation: 92509

Looking for big, complex sample data

I want to benchmark some (graph) databases and looking for some big, complex datasets. The dataset should have a size between 2 TB and 5 TB. Do you know any sample datasets (maybe open government or science data) which fullfills these criteria?

Upvotes: 0

Views: 136

Answers (1)

Rishi Dua
Rishi Dua

Reputation: 2334

These should fit your requirements

  • The 1000 Genomes project makes 260 TB of human genome data available
  • The Internet Archive is making an 80 TB web crawl available for research
  • The TREC conference made the ClueWeb09 dataset available a few years back. You'll have to sign an agreement and pay a nontrivial fee (up to $610) to cover the sneakernet data transfer. The data is about 5 TB compressed.
  • ClueWeb12 is now available, as are the Freebase annotations, FACC1
  • CNetS at Indiana University makes a 2.5 TB click dataset available
  • ICWSM made a large corpus of blog posts available for their 2011 conference. You'll have to register (an actual form, not an online form), but it's free. It's about 2.1 TB compressed.
  • The Proteome Commons makes several large datasets available. The largest, the Personal Genome Project, is 1.1 TB in size.

There are several others over 100 GB in size.

Upvotes: 2

Related Questions