Rahul
Rahul

Reputation: 1423

Suggest hadoop architecture

I have a script which matches 5 million records(A) with 40k records(B). If there is a match a email is placed on to the queue asynchronously.

The match occurs if certain attributes of A matches with B. Currently this script takes about 1 day to complete.

I want to reduce this time to lets say 3 to 4 hours using hadoop.

I think, I will store A in file and this file will be input to my Mapper. Reducer can be eliminated.

What should be the storage strategy for B for minimum or no disk reads? As in where should I store it. A memcache, hdfs etc . Memcache seems to be a good option since it eliminates disk access during runtime. But suggestions are welcomed.

I am new to hadoop. So what is recommended approach in this scenario.

Upvotes: 1

Views: 197

Answers (1)

yjshen
yjshen

Reputation: 6693

I'm not sure I can give a suitable answer....
Is your B file small enough to put them all into the memory of the mapper?
If so, Hadoop has a mechanism called distributed cache. It makes it possible to distribute a file to all the nodes in the cluster. In your case, you can make B a cache file, load it into the memory though configure() and use it in your mapper.

DistributedCache.addCacheFile(/*B's path*/); //in run()

Upvotes: 3

Related Questions