sulabhc
sulabhc

Reputation: 666

Does Hadoop distcp copy replicas

If I use distcp to copy data within 2 clusters, does it copy all replicas or does it just copy 1 replica of data and replicates it on the new cluster ?

Say for example, I try to copy 3gb of data from a cluster with replication factor(RF) of 3. Will distcp copy the full 3gb of data, or does it know that since RF is 3, it needs to move only 1gb (one copy) of data. Finally on the destination cluster it looks at the RF and accordingly replicates the data.

Upvotes: 0

Views: 1626

Answers (2)

Sreeja Sreenivasan
Sreeja Sreenivasan

Reputation: 21

While you replicate using distcp only the actual data (that is 1 copy of the data) will be replicated/copied. The replication will be handled by the framework just like how it is handled when a fresh data is written to HDFS. In addition to that, in case of distcp's between 2 clusters, you can also specify whether you want to preserve the replication factor at the source.

For more information :
https://hadoop.apache.org/docs/stable1/distcp.html

Upvotes: 1

harpun
harpun

Reputation: 4110

The raw data size matters. In case the raw data is 1 GB, it takes up to 3x1 GB for replication factor = 3. When copying data from one cluster to another the raw data matters. Only your raw 1 GB of data will be copied to the destination cluster.

HDFS handles the replication of blocks internally. It will notice new data on the cluster and replicate those blocks accordingly, which are under-replicated, i.e. have less replicas than RF.

Upvotes: 4

Related Questions