Does Hadoop copyFromLocal creates 2 copies? - 1 inside hdfs and other inside datanode?

Question

I have installed a pseudo distributed standalone hadoop version on Ubuntu present inside my vmware installed on my windows10.

I downloaded a file from internet and copied into ubuntu local directory /lab/data

I have created namenode and datanode folders(not hadoop folder) with name namenodep and datan1 in ubuntu. I have also created a folder inside hdfs as /input.

When I copied the file from ubuntu local to hdfs, why is that file is present in both the below directories?

$ hadoop fs -copyFromLocal /lab/data/Civil_List_2014.csv /input

$hadoop fs -ls /input/
input/Civil_List_2014.csv   ?????

$cd lab/hdfs/datan1/current
blk_3621390486220058643   ?????
blk_3621390486220058643_1121.meta

Basically I want to understand if it created 2 copies, 1 inside datan1 folder and the other inside hdfs?

Thanks

Manjunath Ballur · Accepted Answer

No. Only one copy is created.

When you create a file in HDFS, the contents of the file are stored on one of the disks of the Data Node. The disk location where the Data Node stores the data is determined by the configuration parameter: dfs.datanode.data.dir (present in hdfs-site.xml)

Check the description of this property:


    dfs.datanode.data.dir
    file:///e:/hdpdatadn/dn
    Determines where on the local filesystem an DFS data node
    should store its blocks.  If this is a comma-delimited
    list of directories, then data will be stored in all named
    directories, typically on different devices.
    Directories that do not exist are ignored.
    
    true

So above, the contents of your file HDFS file "/input/Civil_List_2014.csv", are stored in physical location: lab/hdfs/datan1/current/blk_3621390486220058643.

"blk_3621390486220058643_1121.meta" contains the check sum of the data stored in "blk_3621390486220058643".

This file may be small enough to be put in a single file. But, if a file is big (assuming > 256 MB and a Hadoop block size of 256 MB), then Hadoop splits the contents of the file into 'n' number of blocks and stores them on the disk. In that case, you will see 'n' number of "blk_*" files in the data node's data directory.

Also, since the replication factor is typically set to "3", 3 instances of the same block are created.

Does Hadoop copyFromLocal creates 2 copies? - 1 inside hdfs and other inside datanode?

Answers (2)

Related Questions