Anil Padia
Anil Padia

Reputation: 563

Hadoop Map Reduce queries for large key spaces

I need to process one billion of records periodically. The unique keys can be in range of 10 millions. Value is string with maximum 200K chars.

Here are my questions:

  1. Is the key space very large (10 millions). Would Hadoop be able to handle such a large key space? There will be one reducer per key, so there will be millions of reducers.

  2. I want to update the DB in the reducer itself. In the reducer, I will merge the values (say it current value), read existing value from DB (say it existing value), merge current and existing value and update the DB. Is this a right strategy?

  3. How many reducers can run per box simultaneously? Is it configurable? If only a single reducer runs per box at a time, it will be problem, as I won't be able to update the state for keys in DB very fast.

  4. I want the job to get completed in 2-3 hours. How many boxes would I need ( I can spare max 50 boxes - 64 GB RAM, 8 Core machines)

Thanks

Upvotes: 0

Views: 1715

Answers (1)

Kartikeya Sinha
Kartikeya Sinha

Reputation: 508

Answers to your questions:

a. You have got the wrong concept of Key,Value distribution among reducers. Number of reducers isn't equal to the number of unique mapper output keys. The concept is - all the values associated to a key from mapper goes to a single reducer. This in no way means that a reducer will get only one key.

For example, consider the following mapper outputs:

Mapper(k1,v1), Mapper(k1,v2), Mapper(k1,v3)
Mapper(k2,w1), Mapper(k2,w2)
Mapper(k3,u1), Mapper(k3,u2), Mapper(k3,u3), Mapper(k3,u4)

So, the values related to k1 - v1,v2 and v3 will go into a single reducer, say R1, and it won't get split up into multiple reducers. But it doesn't mean that R1 would have only 1 key k1 to process. It may have values of k2 or k3 also. But for any key that a reducer receives, all the values associated to that key will come to the same reducer. Hope it clears your doubt.

b. Which DB are you using? To reduce DB calls or update statements, you can have your query at the end of the reducer() after the looping through the values related to a particular key is complete.

For example:

public static class ReduceJob extends MapReduceBase implements Reducer<Text, Text, Text, Text> {

        @Override
        public synchronized void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output,
                Reporter reporter) throws IOException {


            while (values.hasNext()) {
                      // looping through the values
            }
            // have your DB update etc. query here to reduce DB calls
      }
}

c. Yes, the number of reducers are configurable. If you want to set it per job basis, you can add a line in your job code run() method which sets the number of reducers.

jobConf.set("mapred.reduce.tasks", numReducers)

If you want to set it per machine basis, i.e. how many reducers each machine in your cluster should have, then you need to change the hadoop configuration of your cluster as:

mapred.tasktracker.{map|reduce}.tasks.maximum - The maximum number of MapReduce tasks, which are run simultaneously on a given TaskTracker, individually. Defaults to 2 (2 maps and 2 reduces), but vary it depending on your hardware.

More details here: http://hadoop.apache.org/docs/stable/cluster_setup.html#Configuring+the+Hadoop+Daemons

d. If your data files are not gZipped(hadoop InputSplit does not work with gZipped files), then as per what you said, you have 200 * 1024 * 1 billion bytes = 204800 GB or 204.800 TB data approx, so if you want to get it completed in 2-3 hours, better spare all the 50 boxes and if the memory footprint of the reducer is low, then increase the number of reducers per machine as per last answer. Also, increasing the InputSplit size to around 128MB might help.

Thanks and Regards.
Kartikeya Sinha

Upvotes: 3

Related Questions