Reputation: 405
I tried to compare the performance about write a big file between local filesystem and HDFS. The result is sort of makes me confused. Time elapsed through write local is shorter than HDFS. I don't understand the concept "Hadoop is good for sequential data access"...
[root@datanodetest01 tmp]# dd if=/dev/zero of=testfile count=1 bs=256M
1+0 records in
1+0 records out
268435456 bytes (268 MB) copied, 0.324765 s, 827 MB/s
[root@datanodetest01 tmp]# time hadoop fs -put testfile /tmp
real 0m3.461s
user 0m6.829s
sys 0m0.666s
Upvotes: 1
Views: 2792
Reputation: 5538
HDFS comes with plethora of benefit, which is listed here and here
Please note that whether you store the data within local disk or on HDFS, ultimately you would like to dome some processing on the data. In that case, all the Big data technology stack leverage upon HDFS characteristic to provide a fast processing of data in a fault tolerant manner.
The difference in copying a data within local vs hdfs can simply be attributed to below facts:
1) HDFS makes at least 3 copies of data, so that it works into Highly available manner, no matter whether a machine is cluster goes kaput.
2) In HDFS, the data copies are maintained on different machine across cluster, hence some network I/O takes place.
Also note that - ref http://hadooptutorials.co.in/tutorials/hadoop/hadoop-fundamentals.html
Hadoop uses blocks to store a file or parts of a file. A Hadoop block is a file on the underlying filesystem. Since the underlying filesystem stores files as blocks, one Hadoop block may consist of many blocks in the underlying file system. Blocks are large. They default to 64 megabytes each and most systems run with block sizes of 128 megabytes or larger.
Hadoop is designed for streaming or sequential data access rather than random access. Sequential data access means fewer seeks, since Hadoop only seeks to the beginning of each block and begins reading sequentially from there.
A good read is given - Hadoop sequential data access
Upvotes: 4
Reputation: 4499
Hadoop stores files as block (default 128mb) and replicates these blocks on hdd of different nodes.
When you read data from hdd the hdd head has to move across the platter to appropriate locations to read data. If you read many small files spread across the hdd then the seek time for head to move between these location adds significant overhead (reduces throughput).
If you read a single large file head has to seek once and read the data. The total seek time is now less than read time.
This is what is referred as "Hadoop is efficient of sequential access" and it is not so good for random access (with high seek time)
You should not compare hadoop performance with local as hadoop incurs heavy (network) overhead to transfer to/from different nodes over network.
Upvotes: 1