Matt E
Matt E

Reputation: 477

Processing Large Binary Files with Hadoop

I know there have been similar posts on here but I can't find one that really has a solid answer.

We have a Hadoop cluster loaded with binary files. These files can range anywhere in size from a few hundred k to hundreds of mb.

We are currently processing these files using a custom record reader that reads the entire contents of the file into each map. From there we extract the appropriate metadata we want a serialize it into JSON.

The problem we are foreseeing is that we might eventually reach a size that our namenode can't handle. There is only so much memory to go around and having a namenode with a couple terabytes of memory seems ridiculous.

Is there a graceful way to process large binary files like this? Especially those which can't be split because we don't know what order the reducer will put them back together?

Upvotes: 2

Views: 7494

Answers (3)

Chris White
Chris White

Reputation: 30089

So not an answer as such, but i have so many questions that a list of comments would be more difficult to convey, so here goes:

You say you read the entire contents into memory for each map, are you able to elaborate on the actual binary input format of these files:

  • Do they contain logical records i.e. does a single input file represent a single record, or does it contain many records?
  • Are the files compressed (after-the-fact or some internal compression mechanism)?
  • How are you currently processing this file-at-once, what's you're overall ETL logic to convert to JSON?
  • Do you actually need to read the entire file read into memory before processing can begin or can you process once you have a buffer of some size populated (DOM vs SAX XML parsing for example).

My guess is that you can migrate some of your mapper logic to the record reader, and possibly even find a way to 'split' the file between multiple mappers. This would then allow you to address your scalability concerns.

To address some points in your question:

  • NameNode only requires memory to store information about the blocks (names, blocks[size, length, locations]). Assuming you assign it a decent memory footprint (GB's), there is no reason you can't have a cluster that holds Petabytes of data in HDFS storage (assuming you have enough physical storage)

Upvotes: 1

When you have large binary files, use SequenceFile format as the input format and set the mapred input split size accordingly. You can set the number of mappers based on the total input size and the split size you had set. Hadoop will take care of splitting the input data.

If you have binary files compressed in some format, then hadoop cannot do this split. So the binary format has to be SequenceFile.

Upvotes: 0

Tariq
Tariq

Reputation: 34184

Namenode doesn't have anything to do either with storage or processing.You should be concentrated on your Datanodes and Tasktrackers instead.Also I am not getting whether you are trying to address the storage issue or the processing of of your files here.If you are dealing with lots of Binary files, it is worth having a look at Hadoop SequenceFile. A SequenceFile is a flat file consisting of binary key/value pairs, hence extensively used in MapReduce as input/output formats. For a detailed explanation you can visit this page -

http://wiki.apache.org/hadoop/SequenceFile

Upvotes: 0

Related Questions