Reputation: 331
I am new to hadoop need to learn details about backup and recovery. I have revised oracle backup and recovery will it help in hadoop?From where should I start
Upvotes: 6
Views: 5388
Reputation: 1
Hadoop backup and recovery may have some similarities with Oracle, but it also has its own unique aspects due to the distributed and fault-tolerant nature of Hadoop clusters.
HDFS snapshots, as mentioned earlier by Brandon, are useful for creating point-in-time copies of your data. Additionally, Hadoop supports incremental backups, which can reduce the data transfer overhead and storage requirements when performing backups. It's very important for infrastructures with lots of data, like HPC for example.
I agree that data replication is not a complete DR solution, however it is a crucial aspect of data durability in Hadoop. By default, HDFS replicates data blocks across multiple nodes for fault tolerance. The replication factor can be adjusted to balance data durability and storage space.
Also need to elaborate on the importance of off-site backups for true DR. I would suggest options such as cross-cluster replication (syncing data to a separate Hadoop cluster in a different geographical location) and using cloud storage services like Amazon S3, Azure Blob Storage, or Google Cloud Storage to store backups. This ensures data availability even in the event of a catastrophic cluster failure or data center outage. Use tools like distcp
for copying data between clusters and platforms, and also it can be configured for incremental backups. Apache NiFi can facilitate data flow between Hadoop and different data stores, including cloud environments. Apache Ambari can assist in managing cluster configurations, which is crucial for restoring a Hadoop environment after a failure.
Upvotes: 0
Reputation: 4010
Hadoop is designed to work on the big cluster with 1000's of nodes. Data loss is possibly less. You can increase the replication factor to replicate the data into many nodes across the cluster.
Refer Data Replication
For Namenode log backup, Either you can use the secondary namenode or Hadoop High Availability
Secondary Namenode
Secondary namenode will take backup for the namnode logs. If namenode fails then you can recover the namenode logs (which holds the data block information) from the secondary namenode.
High Availability
High Availability is a new feature to run more than one namenode in the cluster. One namenode will be active and the other one will be in standby. Log saves in both namenode. If one namenode fails then the other one becomes active and it will handle the operation.
But also we need to consider for Backup and Disaster Recovery in most cases. Refer @brandon.bell answer.
Upvotes: 2
Reputation: 38910
Start with official documentation website : HdfsUserGuide
Have a look at below SE posts:
Hadoop 2.0 data write operation acknowledgement
Hadoop: HDFS File Writes & Reads
Hadoop 2.0 Name Node, Secondary Node and Checkpoint node for High Availability
How does Hadoop Namenode failover process works?
Documentation page regarding Recovery_Mode:
Typically, you will configure multiple metadata storage locations. Then, if one storage location is corrupt, you can read the metadata from one of the other storage locations.
However, what can you do if the only storage locations available are corrupt? In this case, there is a special NameNode startup mode called Recovery mode that may allow you to recover most of your data.
You can start the NameNode in recovery mode like so: namenode -recover
Upvotes: 0
Reputation: 146
You can use the HDFS sync application on DataTorrent for DR use cases to backup high volumes of data from one HDFS cluster to another.
https://www.datatorrent.com/apphub/hdfs-sync/
It uses Apache Apex as a processing engine.
Upvotes: 0
Reputation: 1411
There are a few options for backup and recovery. As s.singh points out, data replication is not DR.
HDFS supports snapshotting. This can be used to prevent user errors, recover files, etc. That being said, this isn't DR in the event of a total failure of the Hadoop cluster. (http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsSnapshots.html)
Your best bet is keeping off-site backups. This can be to another Hadoop cluster, S3, etc and can be performed using distcp. (http://hadoop.apache.org/docs/stable1/distcp2.html), (https://wiki.apache.org/hadoop/AmazonS3)
Here is a Slideshare by Cloudera discussing DR (http://www.slideshare.net/cloudera/hadoop-backup-and-disaster-recovery)
Upvotes: 6