HDFS Block Split

Question

My Hadoop knowledge is 4 weeks old. I am using a sandbox with Hadoop.

According to the theory, when a file is copied into the HDFS file system, it will be split into 128 MB blocks. Each block will then be copied into different data nodes and then replicated to data nodes.

Question:

When I copy a data file (~500 MB) from local file system into HDFS (put command) entire file is still present in HDFS (-ls command). I was expecting to see 128 MB block. What am I doing wrong here ?
If suppose, I manage to split & distribute data file in HDFS, is there a way to combine and retrieve original file back to local file system ?

Keegan · Accepted Answer

You won't see the individual blocks from the -ls command. These are the logical equivalent of blocks on a hard drive not showing up in Linux's ls or Windows Explorer. You can do this on the commandline like hdfs fsck /user/me/someFile.avro -files -blocks -locations, or you can use the NameNode UI to see which hosts have the blocks for a file, and on which hosts each block is replicated.
Sure. You'd just do something like hdfs dfs -get /user/me/someFile.avro or download the file using HUE or the NameNode UI. All these options will stream the appropriate blocks to you to assemble the logical file back together.

HDFS Block Split

Answers (1)

Related Questions