Reputation: 1423
I have a script which matches 5 million records(A) with 40k records(B). If there is a match a email is placed on to the queue asynchronously.
The match occurs if certain attributes of A matches with B. Currently this script takes about 1 day to complete.
I want to reduce this time to lets say 3 to 4 hours using hadoop.
I think, I will store A in file and this file will be input to my Mapper. Reducer can be eliminated.
What should be the storage strategy for B for minimum or no disk reads? As in where should I store it. A memcache, hdfs etc . Memcache seems to be a good option since it eliminates disk access during runtime. But suggestions are welcomed.
I am new to hadoop. So what is recommended approach in this scenario.
Upvotes: 1
Views: 197
Reputation: 6693
I'm not sure I can give a suitable answer....
Is your B file small enough to put them all into the memory of the mapper?
If so, Hadoop has a mechanism called distributed cache. It makes it possible to distribute a file to all the nodes in the cluster. In your case, you can make B a cache file, load it into the memory though configure() and use it in your mapper.
DistributedCache.addCacheFile(/*B's path*/); //in run()
Upvotes: 3