Kris Dimitrov
Kris Dimitrov

Reputation: 480

HDFS buffered write/read operations

I am using the HDFS Java API and FSDataOutput and FSDataInput streams to write/read files to a Hadoop 2.6.0 cluster of 4 machines.

The FS stream implementations have a bufferSize constructor parameter which I assume is for the internal cache of the stream. But it seems that it has absolutely no effect at all to the write/read speed, regardless of its value (I tried values between 8KB and up to several MBytes).

I was wondering if there is some way to achieve buffered write/read to HDFS cluster, different from wrapping the FSDataOutput/Input into BufferedOutput/Input streams?

Upvotes: 3

Views: 4398

Answers (1)

Kris Dimitrov
Kris Dimitrov

Reputation: 480

I have found the answer.

The bufferSize parameter of the FileSystem.create() is actually io.file.buffer.size which as we can read from the documentation is:

"The size of buffer for use in sequence files. The size of this buffer should probably be a multiple of hardware page size (4096 on Intel x86), and it determines how much data is buffered during read and write operations."

From the book "Hadoop: The Definitive Guide" we can read that a good starting point is setting it to 128KB.

As for the internal cache in the client side: Hadoop transmits data in the form of packets (default size is 64KB). This parameter can be tweaked with the dfs.client-write-packet-size option in the Hadoop hdfs-site.xml configuration. For my purposes I used 4MB.

Upvotes: 5

Related Questions