HZhang
HZhang

Reputation: 21

How do I configure Hadoop such that each datanode uses a different directory?

How do I configure Hadoop such that each datanode uses a different directory for storage?

Every datanode shares a storage space. I'd like datanode1 to use dir1, datanode2 to use dir2. At first, I configured all the datanodes to use a same directory in the shared storage and it turned out that there's only one datanode running.

Upvotes: 1

Views: 2022

Answers (3)

kuldeep mishra
kuldeep mishra

Reputation: 154

you can have datanodes and namenodes to share a common storage by creating soft-links like below: host1:

lrwxrwxrwx  1 user user   39 Dec  2 17:30 /hadoop/hdfs/datanode -> /shared_storage/datanode1/
lrwxrwxrwx  1 user user   39 Dec  2 17:31 /hadoop/hdfs/namenode -> /shared_storage/namenode1/

host2:

lrwxrwxrwx 1 user user   39 Dec  2 17:32 /hadoop/hdfs/datanode -> /shared_storage/datanode2/
lrwxrwxrwx 1 user user   39 Dec  2 17:32 /hadoop/hdfs/namenode -> /shared_storage/namenode2/

host3

lrwxrwxrwx 1 user user   39 Dec  2 17:33 /hadoop/hdfs/datanode -> /shared_storage/datanode3/
lrwxrwxrwx 1 user user   39 Dec  2 17:32 /hadoop/hdfs/namenode -> /shared_storage/namenode3/

host4:

lrwxrwxrwx 1 user user   39 Dec  2 17:33 /hadoop/hdfs/datanode -> /shared_storage/datanode4/
lrwxrwxrwx 1 user user   39 Dec  2 17:33 /hadoop/hdfs/namenode -> /shared_storage/namenode4/

In hdfs-site.xml on each datanode:

  <property>
   <name>dfs.namenode.name.dir</name>
      <value>file:///hadoop/hdfs/datanode</value>
  </property>

  <property>
         <name>dfs.namenode.name.dir</name>
         <value>file:///hadoop/hdfs/datanode</value>
  </property>

Upvotes: 0

Suresh Vadali
Suresh Vadali

Reputation: 139

I don't know if its a crude way of doing but this is how I customized slaves.sh file in the namenode to achieve implementation of different directory structure for each datanode:

Edit the ssh remote command executed on each datanode in $HADOOP_HOME/bin/slaves.sh :

for slave in `cat "$HOSTLIST"|sed  "s/#.*$//;/^$/d"`; do
 # If the slave node is ap1001 (first datanode),
 # Then use a different directory path for SSH command.
 if [ $slave == "ap1001" ]
 then
      input=`/bin/echo $"${@// /\\ }"` >/dev/null 2>&1
      # If the command type is start-dfs (start the datanodes)
      # Then construct the start command for remote execution on datanode through ssh
      /bin/echo $input | grep -i start
      if [ $? -eq 0 ]
      then
          inputArg="cd /app2/configdata/hdp/hadoop-1.2.1 ; /app2/configdata/hdp/hadoop-1.2.1/bin/hadoop-daemon.sh --config /app2/configdata/hdp/hadoop-1.2.1/libexec/../conf start datanode"
      else
          # If the command type is stop-dfs (stop the datanodes)
          # Then construct the stop command for remote execution on datanode through ssh
          inputArg="cd /app2/configdata/hdp/hadoop-1.2.1 ; /app2/configdata/hdp/hadoop-1.2.1/bin/hadoop-daemon.sh --config /app2/configdata/hdp/hadoop-1.2.1/libexec/../conf stop datanode"
      fi
      ssh $HADOOP_SSH_OPTS $slave $inputArg 2>&1 &
 else
      # Use default command for remaining slaves.
      ssh $HADOOP_SSH_OPTS $slave $"${@// /\\ }" \
      2>&1 | sed "s/^/$slave: /" &
 fi
 if [ "$HADOOP_SLAVE_SLEEP" != "" ]; then
   sleep $HADOOP_SLAVE_SLEEP
 fi
done

Upvotes: 0

Chris White
Chris White

Reputation: 30089

You'll need to have a custom hdfs-site.xml file for each node in your cluster, with the data directory property (dfs.data.dir) configured appropriately. If you're currently using a shared directory for the hadoop configuration as well then you'll need to amend how you're doing this as well.

Somewhat painful, i guess you could try and use some shell scripting to generate the files, or a tool such as Puppet or Chef.

A question back at you - why are you using NFS, you're somewhat defeating the point of data locality - Hadoop is designed to move your code to where the data is, not (as your case) both the code and the data.

If you're using NFS because it's backed by some SAN array with data redundancy then again you're making things difficult for yourself, HDFS will (if configured) manage data replication for you, assuming you have a big enough cluster and it's properly configured. It should in theory also cost less using commodity hardware than backing with an expensive SAN (depends on your setup / situation i guess)

Upvotes: 2

Related Questions