Reputation: 21
How do I configure Hadoop such that each datanode uses a different directory for storage?
Every datanode shares a storage space. I'd like datanode1 to use dir1, datanode2 to use dir2. At first, I configured all the datanodes to use a same directory in the shared storage and it turned out that there's only one datanode running.
Upvotes: 1
Views: 2022
Reputation: 154
you can have datanodes and namenodes to share a common storage by creating soft-links like below: host1:
lrwxrwxrwx 1 user user 39 Dec 2 17:30 /hadoop/hdfs/datanode -> /shared_storage/datanode1/
lrwxrwxrwx 1 user user 39 Dec 2 17:31 /hadoop/hdfs/namenode -> /shared_storage/namenode1/
host2:
lrwxrwxrwx 1 user user 39 Dec 2 17:32 /hadoop/hdfs/datanode -> /shared_storage/datanode2/
lrwxrwxrwx 1 user user 39 Dec 2 17:32 /hadoop/hdfs/namenode -> /shared_storage/namenode2/
host3
lrwxrwxrwx 1 user user 39 Dec 2 17:33 /hadoop/hdfs/datanode -> /shared_storage/datanode3/
lrwxrwxrwx 1 user user 39 Dec 2 17:32 /hadoop/hdfs/namenode -> /shared_storage/namenode3/
host4:
lrwxrwxrwx 1 user user 39 Dec 2 17:33 /hadoop/hdfs/datanode -> /shared_storage/datanode4/
lrwxrwxrwx 1 user user 39 Dec 2 17:33 /hadoop/hdfs/namenode -> /shared_storage/namenode4/
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///hadoop/hdfs/datanode</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///hadoop/hdfs/datanode</value>
</property>
Upvotes: 0
Reputation: 139
I don't know if its a crude way of doing but this is how I customized slaves.sh file in the namenode to achieve implementation of different directory structure for each datanode:
Edit the ssh remote command executed on each datanode in $HADOOP_HOME/bin/slaves.sh
:
for slave in `cat "$HOSTLIST"|sed "s/#.*$//;/^$/d"`; do
# If the slave node is ap1001 (first datanode),
# Then use a different directory path for SSH command.
if [ $slave == "ap1001" ]
then
input=`/bin/echo $"${@// /\\ }"` >/dev/null 2>&1
# If the command type is start-dfs (start the datanodes)
# Then construct the start command for remote execution on datanode through ssh
/bin/echo $input | grep -i start
if [ $? -eq 0 ]
then
inputArg="cd /app2/configdata/hdp/hadoop-1.2.1 ; /app2/configdata/hdp/hadoop-1.2.1/bin/hadoop-daemon.sh --config /app2/configdata/hdp/hadoop-1.2.1/libexec/../conf start datanode"
else
# If the command type is stop-dfs (stop the datanodes)
# Then construct the stop command for remote execution on datanode through ssh
inputArg="cd /app2/configdata/hdp/hadoop-1.2.1 ; /app2/configdata/hdp/hadoop-1.2.1/bin/hadoop-daemon.sh --config /app2/configdata/hdp/hadoop-1.2.1/libexec/../conf stop datanode"
fi
ssh $HADOOP_SSH_OPTS $slave $inputArg 2>&1 &
else
# Use default command for remaining slaves.
ssh $HADOOP_SSH_OPTS $slave $"${@// /\\ }" \
2>&1 | sed "s/^/$slave: /" &
fi
if [ "$HADOOP_SLAVE_SLEEP" != "" ]; then
sleep $HADOOP_SLAVE_SLEEP
fi
done
Upvotes: 0
Reputation: 30089
You'll need to have a custom hdfs-site.xml file for each node in your cluster, with the data directory property (dfs.data.dir
) configured appropriately. If you're currently using a shared directory for the hadoop configuration as well then you'll need to amend how you're doing this as well.
Somewhat painful, i guess you could try and use some shell scripting to generate the files, or a tool such as Puppet or Chef.
A question back at you - why are you using NFS, you're somewhat defeating the point of data locality - Hadoop is designed to move your code to where the data is, not (as your case) both the code and the data.
If you're using NFS because it's backed by some SAN array with data redundancy then again you're making things difficult for yourself, HDFS will (if configured) manage data replication for you, assuming you have a big enough cluster and it's properly configured. It should in theory also cost less using commodity hardware than backing with an expensive SAN (depends on your setup / situation i guess)
Upvotes: 2