Reputation: 164

Hadoop block size issues

I've been tasked with processing multiple terabytes worth of SCM data for my company. I set up a hadoop cluster and have a script to pull data from our SCM servers.

Since I'm processing data with batches through the streaming interface, I came across an issue with the block sizes that O'Reilly's Hadoop book doesn't seem to address: what happens to data straddling two blocks? How does the wordcount example get around this? To get around the issue so far, we've resorted to making our input files smaller than 64mb each.

The issue came up again when thinking about the reducer script; how is aggregated data from the maps stored? And would the issue come up when reducing?

Upvotes: 0

Answers (3)

rICh

Reputation: 1709

Your question about "data straddling two blocks" is what the RecordReader handles. The purpose of a RecordReader is 3 fold:

Ensure each k,v pair is processed
Ensure each k,v pair is only processed once
Handle k,v pairs which are split across blocks

What actually happens in (3) is that the RecordReader goes back to the NameNode, gets the handle of a DataNode where the next block lives, and then reaches out via RPC to pull in that full block and read the remaining part of that first record up to the record delimiter.

Upvotes: 0

Joe Stein

Reputation: 1275

This should not be an issue providing that each block can cleanly break a part the data for the splits (like by line break). If your data is not a line by line data set then yes this could be a problem. You can also increase the size of your blocks on your cluster too (dfs.block.size).

You can also customize in your streaming how the inputs are going into your mapper

http://hadoop.apache.org/common/docs/current/streaming.html#Customizing+the+Way+to+Split+Lines+into+Key%2FValue+Pairs

Data from the map step gets sorted together based on a partioner class against the key of the map.

http://hadoop.apache.org/common/docs/r0.15.2/streaming.html#A+Useful+Partitioner+Class+%28secondary+sort%2C+the+-partitioner+org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner+option%29

The data is then shuffled together to make all the map keys get together and then transferred to the reducer. Sometimes before the reducer step happens a combiner comes in if you like.

Most likely you can create your own custom -inputreader (here is example of how to stream XML documents http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/streaming/StreamXmlRecordReader.html)

Upvotes: 1

wlk

Reputation: 5785

If you have multiple terabytes input you should consider setting block size to even more then 128MB.

If file is bigger than one block it can either be split, so each block of file would go to different mapper, or whole file can go to one mapper (for example if this file is gzipped). But I guess you can set this using some configuration options.

Splits are taken care of automatically and you should not worry about it. Output from maps is stored in tmp directory on hdfs.

Upvotes: 0

Hadoop block size issues

Answers (3)

Related Questions