hello-klol
hello-klol

Reputation: 745

Hadoop conf to determine num map tasks

I have a job, like all my Hadoop jobs, it seems to have a total of 2 map tasks when running from what I can see in the Hadoop interface. However, this means it is loading so much data that I get a Java Heap Space error.

I've tried setting many different conf properties in my Hadoop cluster to make the job split into more tasks but nothing seems to have any effect.

I have tried setting mapreduce.input.fileinputformat.split.maxsize, mapred.max.split.size, dfs.block.size but none seem to have any effect.

I'm using 0.20.2-cdh3u6, and trying to run a job using cascading.jdbc - the job is failing on reading data from the database. I think this issue can be resolved by increasing the number of splits but can't work out how to do that!

Please help! Going crazy!

2013-07-23 09:12:15,747 FATAL org.apache.hadoop.mapred.Child: Error running child : java.lang.OutOfMemoryError: Java heap space
        at com.mysql.jdbc.Buffer.<init>(Buffer.java:59)
        at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1477)
        at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:2936)
        at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:477)
        at com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:2631)
        at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:1800)
        at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2221)
        at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2618)
        at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2568)
        at com.mysql.jdbc.StatementImpl.executeQuery(StatementImpl.java:1557)
        at cascading.jdbc.db.DBInputFormat$DBRecordReader.<init>(DBInputFormat.java:97)
        at cascading.jdbc.db.DBInputFormat.getRecordReader(DBInputFormat.java:376)
        at cascading.tap.hadoop.MultiInputFormat$1.operate(MultiInputFormat.java:282)
        at cascading.tap.hadoop.MultiInputFormat$1.operate(MultiInputFormat.java:277)
        at cascading.util.Util.retry(Util.java:624)
        at cascading.tap.hadoop.MultiInputFormat.getRecordReader(MultiInputFormat.java:276)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:370)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:324)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278)
        at org.apache.hadoop.mapred.Child.main(Child.java:260)

Upvotes: 1

Views: 655

Answers (2)

hello-klol
hello-klol

Reputation: 745

My job was reading in data from a table where 1,000 rows equates to about 1MB. This particular job was trying to read in 753,216 urls. Turns out the Java heap space of each task process is capped at 200MB. As Brugere pointed out in the comments on my question, I can set the mapred.child.java.opts property in mapred-site.xml which controls heap space (http://developer.yahoo.com/hadoop/tutorial/module7.html).

I found I also had to set <final>true</final> for this property in my config file else the value was reset to 200MB (is it possible it's reset somewhere in the code? perhaps in cascading.jdbc?).

I'll be looking at setting this heap space property in my code when setting up a job when I detect that it will require a larger amount of heap space, leaving the general Hadoop config setup to use the default 200MB.

Upvotes: 0

Brugere
Brugere

Reputation: 436

You should look at the settings of memory management like io.sort.mb or mapred.cluster.map.memory.mb because heap space errors are generally due to an allocation problem and not to map number.

If you want to force your map number you have to consider that some values are used prior to others. For instance mapreduce.input.fileinputformat.split.maxsize if small will generate a huge amount of taks even if you set mapred.tasktracker.map.tasks.maximum to a small value.

The dfs.block.size has impact on generated map number only if it is greater than the mapreduce.input.fileinputformat.split.maxsize

Upvotes: 1

Related Questions