Reputation: 613
This might be an interesting question to some:
Given: 2-3 Terabyte of data stored in SQL Server(RDBMS), consider it similar to Amazons data, i.e., users -> what things they saw/clicked to see -> what they bought
Task: Make a recommendation engine (like Amazon), which displays to user, customer who bought this also bought this -> if you liked this, then you might like this -> (Also) kind of data mining to predict future buying habits as well(Data Mining). So on and so forth, basically a reco engine.
Issue: Because of the sheer volume of data (5-6 yrs worth of user habit data), I see Hadoop as the ultimate solution. Now the question is, what technological tools combinations to use?, i.e.,
HDFS: Underlying FIle system
HBASE/HIVE/PIG: ?
Mahout: For running some algorithms, which I assume uses Map-Reduce (genetic, cluster, data mining etc.)
- What am I missing? What about loading RDBMS data for all this processing? (Sqoop for Hadoop?)
- At the end of all this, I get a list of results(reco's), or there exists a way to query it directly and report it to the front-end I build in .NET??
I think the answer to this question, just might be a good discussion for many people like me in the future who want to kick start their hadoop experimentation.
Upvotes: 2
Views: 611
Reputation: 544
HBase can fit your scenario. HDFS is the underlying file system. Nevertheless you cannot load the data in HDFS (in arbitrary format) query in HBase, unless you use the HBase file format (HFile)
HBase has integration with MR. Pig and Hive also integrate with HBase. As Chris mentioned it, you can use Thrift to perform your queries (get, scan) since this will extract specific user info and not a massive data set it is more suitable than using MR.
Upvotes: 0
Reputation: 51369
For loading data from RDBMS, I'd recommend looking into BCP (to export from SQL to flat file) then Hadoop command line for loading into HDFS. Sqoop is good for ongoing data but it's going to be intolerably slow for your initial load.
To query results from Hadoop you can use HBase (assuming you want low-latency queries), which can be queried from C# via it's Thrift API.
Upvotes: 1