Reputation: 12938
I am new to Apache Hadoop. We have one Hadoop cluster[1] filled with some data. And there's another Hadoop cluster[2] empty with data. What is the simplest and most preferred way to replicate data from [1] into [2] ?
Upvotes: 0
Views: 2040
Reputation: 1496
You can use DistCp (Distributed copy), It is a tool to allow you copy data between clusters or from/to a different file system like S3 or FTP server.
https://hadoop.apache.org/docs/r1.2.1/distcp2.html
You must specify the absolute path to copy data from external cluster: hdfs://OtherClusterNN:port/path
This tool launch a MapReduce job that copy data in parallel from any kind of source available in Hadoop FileSystem library like HDFS, FTP, S3, AZURE(in latest versions, etc)
To copy data from different versions of hadoop, instead to use HDFS protocol, you must use HftpFileSystem from one of them.
Upvotes: 5