Reputation: 22395
I have a problem to solve and was wondering if I am right to use something like Hadoop for this problem to distribute it across multiple nodes or use something else..
The Problem:
I have a very large database table with potentially a huge amount of records, each record has associated metadata fields (represented as a column on the table) with values. What I want to achive is:
Given a certain criteria, such as search for records with metadata field X and value Y, I want to retrieve some records, but more importantly I want to make some smart suggestions to the user about what to search for next so they can maybe find some interesting records they were not aware of. The way I plan on doing this is checking all the matching records metadata fields and values and present interesting choices to the user to keep filtering further by (the task of determining what is interesting is irrelevant for this question).
Now if my table has a very large amount of records, and the initial "filter" is matched by a lot of records, then retrieving all records to then scan their other columns for suggestions can potentially take a very long time if it is all done in say one query, or even iterating queries that incrementally fetch more and more records.
I was thinking this problem could be solved by distributing the task to multiple nodes to do the search of different records. My question is, should I be looking into something like Hadoop for this (distributing the loads), or can someone suggest some other way to accomplish this task?
Thanks
Daniel
Upvotes: 0
Views: 496
Reputation: 585
If you want a really real-time(around 200ms) response for your search application, for both the first time search query response and further suggested search response, Hadoop is not a good choice, not even Hive, HBase, or even Impala (or Apache Drill, Google Dremel like system).
Hadoop is a batch processing system which is good for write once, read multiple times use cases. And the strength of it is good at scalability and I/O throuput. The trend I saw is that many organizations are using that as a data warehouse for offline data mining and BI analysis purpose as a replacement for data warehosue based on relational databases.
Hive and HBase all provides extra features on top of Hadoop, but neither of those could possibly reach 200ms real time for average complex query workload.
Check the high level view of how 'real-time' each possible solution can really reach, on Apache Drill homepage. CloudEra Impala, or Apache Drill, which borrows the idea from Google Dremel, have the intention to enhance the query speed on top of Hadoop by doing query optimization, column based storage, massive parallelism of I/O. I believe these 2 are still in early stage to achieve the goals they claim. Some initial performance benchmarking result of Impala I found.
If you decide to go with Hadoop or related solution stack, there are possible ways to load data from MySQL to Hadoop using Sqoop or other customized data load applications leveraging Hadoop Distributed File System API. But if you will have new data coming into MySQL time to time, then you need to schedule a job to run periodically to do delta load from MySQL to Hadoop.
On the other hand, the workload to build a Hadoop cluster and find or build suitable data loading tool from MySQL to Hadoop might be a huge workload. Also you need to find a suitable extra layer for runtime data access and build code around that, no mater Impala or other things. To solve your own problem, it's probably better to build your own customized solution, like using in memory cache for hot records with the meta data in your database, along with some index mechanism to quickly locate the data you need for suggested search query calculation. If the memory on one machine cannot hold enough records, a memory cache grid or cluster component comes in handy like Memcached or Reddis, EhCache, etc.
Upvotes: 1
Reputation: 34184
IMHO, Hadoop all by itself won't be able to solve your problem. First of all Hadoop(HDFS to be precise) is a FS and doesn't provide columnar storage wherein you can query for a particular field. Data inside HDFS is stored as flat files and you have to traverse across the data in order to reach upto the point where the data of interest resides.
Having said that, there are some workarounds, like making use of Hive. Hive is another member of the Hadoop family which provides warehousing capability on top of your existing Hadoop cluster. It allows us to map HDFS files as Hive tables which can be queried conveniently. Also, it provides a SQL like interface to query these tables. But, Hive is not a good fit if you have real time needs.
I feel something like Imapala would be more useful to you which allows to query our BigData keeping the real-time-ness in mind.
The reason for whatever I have mentioned above is that your use case requires more than just scalability provided by Hadoop. Along with the ability to distribute the load, your solution should be able to cater the needs you have specified above. It is more than just distributing your data over a group of machines and running raw querying over it. Your users would require real-time response along with the smart suggestions feature you have mentioned in your question.
You actually need a smarter system than just a Hadoop cluster. Do have a look at Apache Mahout. It is an awesome tool that provides the feature of Recommendation Mining and can be used with Hadoop easily. You can find more its home page. I will definitely help you in adding that smart suggestions feature to your system.
You might wanna have a look at another tool of the Hadoop family, the HBase, which is a distributed, scalable, big data store. It acts like a database, but it is not a relational database. It also runs on an existing Hadoop cluster and provides real-time random read/write capabilities. Read a bit about it and see if it fits somewhere.
Last but not the least, it all depends on your needs. An exact decision can be made only after giving different things a try and doing a comparative study. We can just suggest you based on our experiences, but a fair decision can be made only after testing a few tools and finding which one fits best into your requirements :)
Upvotes: 1