Anand Kamathi
Anand Kamathi

Reputation: 331

Hadoop backup and recovery tool and guidance

I am new to hadoop need to learn details about backup and recovery. I have revised oracle backup and recovery will it help in hadoop?From where should I start

Upvotes: 6

Views: 5388

Answers (5)

Andrew I.
Andrew I.

Reputation: 1

Hadoop backup and recovery may have some similarities with Oracle, but it also has its own unique aspects due to the distributed and fault-tolerant nature of Hadoop clusters.

HDFS snapshots, as mentioned earlier by Brandon, are useful for creating point-in-time copies of your data. Additionally, Hadoop supports incremental backups, which can reduce the data transfer overhead and storage requirements when performing backups. It's very important for infrastructures with lots of data, like HPC for example.

I agree that data replication is not a complete DR solution, however it is a crucial aspect of data durability in Hadoop. By default, HDFS replicates data blocks across multiple nodes for fault tolerance. The replication factor can be adjusted to balance data durability and storage space.

Also need to elaborate on the importance of off-site backups for true DR. I would suggest options such as cross-cluster replication (syncing data to a separate Hadoop cluster in a different geographical location) and using cloud storage services like Amazon S3, Azure Blob Storage, or Google Cloud Storage to store backups. This ensures data availability even in the event of a catastrophic cluster failure or data center outage. Use tools like distcp for copying data between clusters and platforms, and also it can be configured for incremental backups. Apache NiFi can facilitate data flow between Hadoop and different data stores, including cloud environments. Apache Ambari can assist in managing cluster configurations, which is crucial for restoring a Hadoop environment after a failure.

Upvotes: 0

Kumar
Kumar

Reputation: 4010

Hadoop is designed to work on the big cluster with 1000's of nodes. Data loss is possibly less. You can increase the replication factor to replicate the data into many nodes across the cluster.

Refer Data Replication

For Namenode log backup, Either you can use the secondary namenode or Hadoop High Availability

Secondary Namenode

Secondary namenode will take backup for the namnode logs. If namenode fails then you can recover the namenode logs (which holds the data block information) from the secondary namenode.

High Availability

High Availability is a new feature to run more than one namenode in the cluster. One namenode will be active and the other one will be in standby. Log saves in both namenode. If one namenode fails then the other one becomes active and it will handle the operation.

But also we need to consider for Backup and Disaster Recovery in most cases. Refer @brandon.bell answer.

Upvotes: 2

Ravindra babu
Ravindra babu

Reputation: 38910

Start with official documentation website : HdfsUserGuide

Have a look at below SE posts:

Hadoop 2.0 data write operation acknowledgement

Hadoop: HDFS File Writes & Reads

Hadoop 2.0 Name Node, Secondary Node and Checkpoint node for High Availability

How does Hadoop Namenode failover process works?

Documentation page regarding Recovery_Mode:

Typically, you will configure multiple metadata storage locations. Then, if one storage location is corrupt, you can read the metadata from one of the other storage locations.

However, what can you do if the only storage locations available are corrupt? In this case, there is a special NameNode startup mode called Recovery mode that may allow you to recover most of your data.

You can start the NameNode in recovery mode like so: namenode -recover

Upvotes: 0

ashwin111
ashwin111

Reputation: 146

You can use the HDFS sync application on DataTorrent for DR use cases to backup high volumes of data from one HDFS cluster to another.

https://www.datatorrent.com/apphub/hdfs-sync/

It uses Apache Apex as a processing engine.

Upvotes: 0

brandon.bell
brandon.bell

Reputation: 1411

There are a few options for backup and recovery. As s.singh points out, data replication is not DR.

HDFS supports snapshotting. This can be used to prevent user errors, recover files, etc. That being said, this isn't DR in the event of a total failure of the Hadoop cluster. (http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsSnapshots.html)

Your best bet is keeping off-site backups. This can be to another Hadoop cluster, S3, etc and can be performed using distcp. (http://hadoop.apache.org/docs/stable1/distcp2.html), (https://wiki.apache.org/hadoop/AmazonS3)

Here is a Slideshare by Cloudera discussing DR (http://www.slideshare.net/cloudera/hadoop-backup-and-disaster-recovery)

Upvotes: 6

Related Questions