Reputation: 155
I have installed a pseudo distributed standalone hadoop version on Ubuntu present inside my vmware installed on my windows10.
I downloaded a file from internet and copied into ubuntu local directory /lab/data
I have created namenode and datanode folders(not hadoop folder) with name namenodep and datan1 in ubuntu. I have also created a folder inside hdfs as /input.
When I copied the file from ubuntu local to hdfs, why is that file is present in both the below directories?
$ hadoop fs -copyFromLocal /lab/data/Civil_List_2014.csv /input
$hadoop fs -ls /input/
input/Civil_List_2014.csv ?????
$cd lab/hdfs/datan1/current
blk_3621390486220058643 ?????
blk_3621390486220058643_1121.meta
Basically I want to understand if it created 2 copies, 1 inside datan1 folder and the other inside hdfs?
Thanks
Upvotes: 0
Views: 326
Reputation: 6343
No. Only one copy is created.
When you create a file in HDFS, the contents of the file are stored on one of the disks of the Data Node. The disk location where the Data Node stores the data is determined by the configuration parameter: dfs.datanode.data.dir (present in hdfs-site.xml)
Check the description of this property:
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///e:/hdpdatadn/dn</value>
<description>Determines where on the local filesystem an DFS data node
should store its blocks. If this is a comma-delimited
list of directories, then data will be stored in all named
directories, typically on different devices.
Directories that do not exist are ignored.
</description>
<final>true</final>
</property>
So above, the contents of your file HDFS file "/input/Civil_List_2014.csv", are stored in physical location: lab/hdfs/datan1/current/blk_3621390486220058643.
"blk_3621390486220058643_1121.meta" contains the check sum of the data stored in "blk_3621390486220058643".
This file may be small enough to be put in a single file. But, if a file is big (assuming > 256 MB and a Hadoop block size of 256 MB), then Hadoop splits the contents of the file into 'n' number of blocks and stores them on the disk. In that case, you will see 'n' number of "blk_*" files in the data node's data directory.
Also, since the replication factor is typically set to "3", 3 instances of the same block are created.
Upvotes: 1
Reputation: 6855
The output from the hadoop fs -ls /input/
command is actually showing you the metadata information and is not actually a physical file, its logical abstraction around the files which are hosted by datanode's. This metadata information is stored by NameNode.
The actual physical file's are split into blocks and are hosted by the datanode's in the path specified in the configuration in your case lab/hdfs/datan1/current
.
Upvotes: 1