Reputation: 33
I have a Java program that reads files and write the contents to new files using the HDFS data input/output stream. My goal is find out the I/O throughput of my HDFS. Below is the code fragment that does the read/write and timing:
long start = System.currentTimeMillis();
FSDataInputStream in = fs.open( new Path(input) );
FSDataOutputStream out = fs.create( new Path(output), true);
while ((bytesRead = in.read(buffer)) > 0) {
out.write(buffer, 0, bytesRead);
data += bytesRead;
}
in.close();
out.close();
long end = System.currentTimeMillis();
System.out.println("Copy data " + data + " Bytes in " +
((double)end-start) + " millisecond");
I expected the time to copy the file proportional to the file size. But when I ran the program for files from 5MB to 50MB, the result did not show this correlation:
Copy data 5242880 Bytes in 844.0 millisecond
Copy data 10485760 Bytes in 733.0 millisecond
Copy data 15728640 Bytes in 901.0 millisecond
Copy data 20971520 Bytes in 1278.0 millisecond
Copy data 26214400 Bytes in 1304.0 millisecond
Copy data 31457280 Bytes in 1543.0 millisecond
Copy data 36700160 Bytes in 2091.0 millisecond
Copy data 41943040 Bytes in 1934.0 millisecond
Copy data 47185920 Bytes in 1847.0 millisecond
Copy data 52428800 Bytes in 3222.0 millisecond
My question is: Why the copy time is not proportional to the file size? Do I use the wrong methods? Any feedback will be appreciated.
My Hadoop is run on the Pseudo-Distributed Operation mode and I clear the cache using command:
sudo sh -c "sync; echo 3 > /proc/sys/vm/drop_caches"
every time before running the program.
Upvotes: 3
Views: 3119
Reputation: 7507
File copy time is affected by many factors, some of them include 1) file size, 2) network latency and transfer speeds, 3) hard drive seek and read/write times, 4) hdfs replication amount.
When you are working with small files (and your 5mb through 50mb are small files) the latency and seek times give you a lower bound on the copy time, then on top of that you have the transfer speed and read/write times. Essentially, don't expect to see a linear time increase unless you start working with significanly larger files. The HDFS filesystem is based around large blocks, I think the default is 64MB and often people put that up to 512MB or larger.
For testing the io times try using these, TestDFSIO, and testfilesystem. They are found in the hadoop hadoop-mapreduce-client-jobclient-*.jar
Upvotes: 2