user3708149
user3708149

Reputation: 21

How hadoop handle very large individual split file

Suppose you only have 1GB heap size which can be used for each mapper, however, the block size is set to be 10 GB and each split is 10GB. How the mapper read the large individual split?

Will the mapper buffer the input into disk and process the input split in a round-robin fashion?

Thanks!

Upvotes: 1

Views: 201

Answers (1)

Clément MATHIEU
Clément MATHIEU

Reputation: 3171

The overall pattern of a mapper is quite simple:

while not end of split
  (key, value) = RecordReader.next()
  (keyOut, valueOut) = map(key, value)
  RecordWriter.write(keyOut, valueOut)

Usually the first two operations only care about the size of the record. For example when TextInputFormat is asked for the next line it stores the bytes in a buffer until the next end of line is found. Then the buffer is cleared. Etc.

The map implementation is up to you. If you don't store things in your mapper then you are fine. If you want it to be stateful, then you can be in trouble. Make sure that your memory consumption is bounded.

In the last step the keys and values written by your mapper are stored in memory. They are then partitioned and sorted. If the in-memory buffer becomes full, then its content is spilled to disk (it will eventually be anyway because reducers need to be able to download the partition file even after the mapper vanished).

So the answer to your question is: yes it will be fine.

What could cause trouble is:

  • Large records (exponential buffer growth + memory copies => significant to insane memory overhead)
  • Storing data from the previous key/value in your mapper
  • Storing data from the previous key/value in your custom (Input|Output)Format implementation if you have one

If you want to learn more, here are a few entry points:

  • In Mapper.java you can see the while loop
  • In LineRecordReader you can see how a line is read by a TextInputFormat
  • You most likely want to understand the spill mechanism because it impacts the performance of your jobs. See theses Cloudera slides for example. Then you will be able to decide what is the best setting for your use case (large vs small splits).

Upvotes: 2

Related Questions