Hadoop conf to determine num map tasks

Question

I have a job, like all my Hadoop jobs, it seems to have a total of 2 map tasks when running from what I can see in the Hadoop interface. However, this means it is loading so much data that I get a Java Heap Space error.

I've tried setting many different conf properties in my Hadoop cluster to make the job split into more tasks but nothing seems to have any effect.

I have tried setting mapreduce.input.fileinputformat.split.maxsize, mapred.max.split.size, dfs.block.size but none seem to have any effect.

I'm using 0.20.2-cdh3u6, and trying to run a job using cascading.jdbc - the job is failing on reading data from the database. I think this issue can be resolved by increasing the number of splits but can't work out how to do that!

Please help! Going crazy!

2013-07-23 09:12:15,747 FATAL org.apache.hadoop.mapred.Child: Error running child : java.lang.OutOfMemoryError: Java heap space
        at com.mysql.jdbc.Buffer.(Buffer.java:59)
        at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1477)
        at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:2936)
        at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:477)
        at com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:2631)
        at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:1800)
        at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2221)
        at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2618)
        at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2568)
        at com.mysql.jdbc.StatementImpl.executeQuery(StatementImpl.java:1557)
        at cascading.jdbc.db.DBInputFormat$DBRecordReader.(DBInputFormat.java:97)
        at cascading.jdbc.db.DBInputFormat.getRecordReader(DBInputFormat.java:376)
        at cascading.tap.hadoop.MultiInputFormat$1.operate(MultiInputFormat.java:282)
        at cascading.tap.hadoop.MultiInputFormat$1.operate(MultiInputFormat.java:277)
        at cascading.util.Util.retry(Util.java:624)
        at cascading.tap.hadoop.MultiInputFormat.getRecordReader(MultiInputFormat.java:276)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:370)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:324)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278)
        at org.apache.hadoop.mapred.Child.main(Child.java:260)

hello-klol · Accepted Answer

My job was reading in data from a table where 1,000 rows equates to about 1MB. This particular job was trying to read in 753,216 urls. Turns out the Java heap space of each task process is capped at 200MB. As Brugere pointed out in the comments on my question, I can set the mapred.child.java.opts property in mapred-site.xml which controls heap space (http://developer.yahoo.com/hadoop/tutorial/module7.html).

I found I also had to set true for this property in my config file else the value was reset to 200MB (is it possible it's reset somewhere in the code? perhaps in cascading.jdbc?).

I'll be looking at setting this heap space property in my code when setting up a job when I detect that it will require a larger amount of heap space, leaving the general Hadoop config setup to use the default 200MB.

Hadoop conf to determine num map tasks

Answers (2)

Related Questions