Tara
Tara

Reputation: 549

Write sequence file using mapreduce and org.apache.hadoop.fs. differences?

I see example of writing sequence file into hdfs using either org.apache.hadoop.fs package or mapreduce. My questions are :

  1. What are the differences?
  2. Is the end result, I mean the sequence file written in HDFS with both methods come up to be the same?
  3. I only tried the org.apache.hadoop.fs to write sequence file, when I tried to use hadoop fs -text to view result, I see the "key" still attached in each record/block? Would it be the same if I used mapreduce to produce the sequence file? I rather not to see the "key"
  4. How does one decide which method to use to write sequence file into HDFS?

Upvotes: 0

Views: 2680

Answers (2)

Jagesh Maharjan
Jagesh Maharjan

Reputation: 913

For the sequence file you will write your content including the object i.e your own custom Object. While text file is just a string as each line.

Upvotes: 1

nochum
nochum

Reputation: 795

The Apache Hadoop Wiki states that "SequenceFile is a flat file consisting of binary key/value pairs". The Wiki shows the actual file format, that includes the key. Note that SequenceFiles support multiple formats, such as "Uncompressed", "Record Compressed", and "Block Compressed". Additionally there are various compression codecs that can be used. Since the file format and compression information is stored in the file header, applications (such as Mapper and Reducer tasks) can easily determine how to correctly process the files.

In the image below you can see that the append() method on the org.apache.hadoop.io.SequenceFile.Writer class requires both a key and a value:

append() method for the SequenceFile.Writer class

Also keep in mind that both the MapReduce Mapper and Reducer ingest and emit key-value pairs. So having the key stored in the SequenceFile allows Hadoop top operate very efficiently with these types of files.

So in a nutshell:

  1. SequenceFiles will always contain a "key" in addition to the "value".
  2. Two SequenceFiles containing the same data are not necessarily exactly the same in terms of size or actual bytes. It all depends on whether compression is used, the type of compression, and the compression codec.
  3. The method you use to create SequenceFiles and add them to HDFS, largely depends on what you are trying to achieve and accomplish. SequenceFiles are typically a means to efficiently accomplish a particular goal, they are rarely the end result.

Upvotes: 0

Related Questions