Reputation: 21

What is the difference between OutputStream and FSDataOutputStream when using Hadoop?

I am new to use Hadoop and while referring a book I saw a number of examples that interchangeably use OutputStream and FSDataOutputStream to interact with HDFS file system. Can anyone briefly explain me the difference between those two classes?

Upvotes: 2

Answers (1)

Chris Nauroth

Reputation: 9844

Apache Hadoop uses the FSDataOutputStream class to layer additional functionality over a JDK DataOutputStream. Browsing through JavaDocs, we can see that there are a few additional methods defined in the subclass:

getPos(): Returns the current position in the stream.
hflush(): An HDFS-specific addition that allows the caller to flush file data and make it visible to concurrent readers of the same file.
hsync(): An HDFS-specific addition that allows the caller to flush/sync file data to the underlying disk at the DataNode for durability.
setDropBehind(Boolean): Controls use of the fadvise syscall at the DataNode to evict block data from buffer cache after reading.

All of these are functionality not defined in the base stream classes, but helpful for Hadoop internals and applications to achieve the desired semantics and improve performance. Notable users of this functionality include Hadoop job history tracking and HBase.

In general, it's good practice for application code to use the most abstract class possible to avoid tight coupling to a particular subclass. That likely explains the code samples using OutputStream. If the extra functionality of FSDataOutputStream is not needed, then there is no need to refer to it.

Upvotes: 2

What is the difference between OutputStream and FSDataOutputStream when using Hadoop?

Answers (1)

Related Questions