Reputation: 745
I have a job, like all my Hadoop jobs, it seems to have a total of 2 map tasks when running from what I can see in the Hadoop interface. However, this means it is loading so much data that I get a Java Heap Space error.
I've tried setting many different conf properties in my Hadoop cluster to make the job split into more tasks but nothing seems to have any effect.
I have tried setting mapreduce.input.fileinputformat.split.maxsize
, mapred.max.split.size
, dfs.block.size
but none seem to have any effect.
I'm using 0.20.2-cdh3u6, and trying to run a job using cascading.jdbc - the job is failing on reading data from the database. I think this issue can be resolved by increasing the number of splits but can't work out how to do that!
Please help! Going crazy!
2013-07-23 09:12:15,747 FATAL org.apache.hadoop.mapred.Child: Error running child : java.lang.OutOfMemoryError: Java heap space
at com.mysql.jdbc.Buffer.<init>(Buffer.java:59)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1477)
at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:2936)
at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:477)
at com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:2631)
at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:1800)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2221)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2618)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2568)
at com.mysql.jdbc.StatementImpl.executeQuery(StatementImpl.java:1557)
at cascading.jdbc.db.DBInputFormat$DBRecordReader.<init>(DBInputFormat.java:97)
at cascading.jdbc.db.DBInputFormat.getRecordReader(DBInputFormat.java:376)
at cascading.tap.hadoop.MultiInputFormat$1.operate(MultiInputFormat.java:282)
at cascading.tap.hadoop.MultiInputFormat$1.operate(MultiInputFormat.java:277)
at cascading.util.Util.retry(Util.java:624)
at cascading.tap.hadoop.MultiInputFormat.getRecordReader(MultiInputFormat.java:276)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:370)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:324)
at org.apache.hadoop.mapred.Child$4.run(Child.java:266)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278)
at org.apache.hadoop.mapred.Child.main(Child.java:260)
Upvotes: 1
Views: 655
Reputation: 745
My job was reading in data from a table where 1,000 rows equates to about 1MB. This particular job was trying to read in 753,216 urls. Turns out the Java heap space of each task process is capped at 200MB. As Brugere pointed out in the comments on my question, I can set the mapred.child.java.opts
property in mapred-site.xml
which controls heap space (http://developer.yahoo.com/hadoop/tutorial/module7.html).
I found I also had to set <final>true</final>
for this property in my config file else the value was reset to 200MB (is it possible it's reset somewhere in the code? perhaps in cascading.jdbc?).
I'll be looking at setting this heap space property in my code when setting up a job when I detect that it will require a larger amount of heap space, leaving the general Hadoop config setup to use the default 200MB.
Upvotes: 0
Reputation: 436
You should look at the settings of memory management like io.sort.mb
or mapred.cluster.map.memory.mb
because heap space errors are generally due to an allocation problem and not to map number.
If you want to force your map number you have to consider that some values are used prior to others. For instance mapreduce.input.fileinputformat.split.maxsize
if small will generate a huge amount of taks even if you set mapred.tasktracker.map.tasks.maximum
to a small value.
The dfs.block.size
has impact on generated map number only if it is greater than the mapreduce.input.fileinputformat.split.maxsize
Upvotes: 1