Reputation: 480
I am using the HDFS Java API and FSDataOutput and FSDataInput streams to write/read files to a Hadoop 2.6.0 cluster of 4 machines.
The FS stream implementations have a bufferSize constructor parameter which I assume is for the internal cache of the stream. But it seems that it has absolutely no effect at all to the write/read speed, regardless of its value (I tried values between 8KB and up to several MBytes).
I was wondering if there is some way to achieve buffered write/read to HDFS cluster, different from wrapping the FSDataOutput/Input into BufferedOutput/Input streams?
Upvotes: 3
Views: 4398
Reputation: 480
I have found the answer.
The bufferSize parameter of the FileSystem.create() is actually io.file.buffer.size which as we can read from the documentation is:
"The size of buffer for use in sequence files. The size of this buffer should probably be a multiple of hardware page size (4096 on Intel x86), and it determines how much data is buffered during read and write operations."
From the book "Hadoop: The Definitive Guide" we can read that a good starting point is setting it to 128KB.
As for the internal cache in the client side: Hadoop transmits data in the form of packets (default size is 64KB). This parameter can be tweaked with the dfs.client-write-packet-size option in the Hadoop hdfs-site.xml configuration. For my purposes I used 4MB.
Upvotes: 5