twid
twid

Reputation: 6686

org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /tmp/hadoop/dfs/name is in an inconsistent state

I am running single node. NameNode always start to fail on starting cluster. I get follwing error.

    2013-06-29 10:37:29,968 FATAL org.apache.hadoop.hdfs.server.namenode.NameNode: Exception in namenode join
org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /tmp/hadoop/dfs/name is in an inconsistent state: storage directory does not exist or is not accessible.
    at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverStorageDirs(FSImage.java:292)
    at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:200)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:627)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:469)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:403)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:437)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:609)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:594)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1169)
    at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1235)
2013-06-29 10:37:29,971 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
2013-06-29 10:37:29,973 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at traw-pc/127.0.0.1
************************************************************/

I know there is same question, And we can resolve by formating NameNode. But my question is that why every time getting this error? This is not a much concern Since i am running Single Node cluster. But in real production environment this may cause Data loose. My guess is since i am using /tmp directory.

Upvotes: 4

Views: 4376

Answers (2)

Carlos Saltos
Carlos Saltos

Reputation: 1511

The error is due to a missing or corrupted HDFS directory, as mentioned by Arafath, this is possible due to using the /tmp default directory which is cleared at reboot time.

To fix this, just add or change the property dfs.name.dir at the etc/hdfs-site.xml file to something like file:///opt/dfs-data.

For recovering from this errors, you need to copy from the fsimage backup of the name node ... the time recovery of the backup for big volumes can takes days even weeks, so it's very recommendable to save check points from time to time so the backup recover is faster.

If you are using a local demo Hadoop, you may just format and start again with these commands:

$HADOOP_HOME/sbin/hadoop-daemon.sh start namenode -format

$HADOOP_HOME/sbin/hadoop-daemon.sh start namenode

IMPORTANT: These commands will delete any previous content, obviously try first the backup recovery

For saving regular checkpoints once the name node is healthy again, the commands are:

hdfs dfsadmin -safemode enter

hdfs dfsadmin -saveNamespace

hdfs dfsadmin -safemode leave

The Hadoop cluster needs to be offline for saving checkpoints (checkpointing is rather fast, so a couple of minutes maintenance every month should be more than OK)

Upvotes: 0

Arafath
Arafath

Reputation: 11

This can be resolved by specifying your namenode dir to a different location is "hdfs-site.xml" in your Hadoop Configuration . Generally it takes default file://${hadoop.tmp.dir}/dfs/name .. So , after every reboot the /tmp directory is cleared and NameNode data is gone

Upvotes: 1

Related Questions