Right database for machine learning on 100 TB of data

Question

I need to perform classification and clustering on about 100tb of web data and I was planning on using Hadoop and Mahout and AWS. What database do you recommend I use to store the data? Will MySQL work or would something like MongoDB be significantly faster? Are there other advantages of one database or the other? Thanks.

Joe K · Accepted Answer

The simplest and most direct answer would be to just put the files directly in HDFS or S3 (since you mentioned AWS) and point Hadoop/Mahout directly at them. Other databases have different purposes, but Hadoop/HDFS is designed for exactly this kind of high-volume, batch-style analytics. If you want a more database-style access layer, then you can add Hive without too much trouble. The underlying storage layer would still be HDFS or S3, but Hive can give you SQL-like access to the data stored there, if that's what you're after.

Just to address the two other options you brought up: MongoDB is good for low-latency reads and writes, but you probably don't need that. And I'm not up on all the advanced features of MySQL, but I'm guessing 100TB is going to be pretty tough for it to deal with, especially when you start getting into large queries that access all of the data. It's more designed for traditional, transactional access.

Right database for machine learning on 100 TB of data

Answers (1)

Related Questions