Suggest hadoop architecture

Question

I have a script which matches 5 million records(A) with 40k records(B). If there is a match a email is placed on to the queue asynchronously.

The match occurs if certain attributes of A matches with B. Currently this script takes about 1 day to complete.

I want to reduce this time to lets say 3 to 4 hours using hadoop.

I think, I will store A in file and this file will be input to my Mapper. Reducer can be eliminated.

What should be the storage strategy for B for minimum or no disk reads? As in where should I store it. A memcache, hdfs etc . Memcache seems to be a good option since it eliminates disk access during runtime. But suggestions are welcomed.

I am new to hadoop. So what is recommended approach in this scenario.

Suggest hadoop architecture

Answers (1)

Related Questions