Reputation: 4319
I have few questions related to hadoop which we are planning to implement in out production environment
We have a large cluster of machines and each machine is a server machine with large RAM and 8 cores. Each 40 machine collects around 60 gb of data every 5 min. These machines are also spread across multiple locations and situated around the world. There is one server machine separate which will act as a namenode in hadoop environment. Rest all 40 machines which are data collectors I am making them part of hadoop cluster as data nodes.
Since the data collection is pretty high on every machine, I do not want my data to be travelling across servers, across geographies. So here are my 2 requirements
1) What I want is my 60 gb data to be split into blocks but should be processed locally. For that I want to have multiple datanodes deomons on the same server. Is it possible to have multiple datanodes deomons running on the same server?
2) Is it possible to process the blocks on the specified datanodes.
I will take up an example to clear out my points Let say I have server machines as A, B, C, D............
A machine will have 60 gb of data every 5 min. Can I run multiple datanodes daemons on the A machine? If it is possible then can I tell my namemode to send the blocks to only datanodes daemons running on server A, and not to other machines.
I do not want high availability of data and do not require fail safe so no need to replicate data.
Upvotes: 3
Views: 5655
Reputation: 1
The data Nodes and name Nodes are just piece of software which are meant to run on any commodity machine. Thus It is possible But it is rarely used in real world. If you speculate the risks involve in data unavailability in a server, then you might get the idea of spreading data nodes across different servers.
In addition to that, the official apache website mentions:
The architecture does not preclude running multiple DataNodes on the same machine but in a real deployment that is rarely the case.
source : https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#NameNode+and+DataNodes
Upvotes: 0
Reputation: 530
To start multiple data nodes on a single node first download / build hadoop binary.
1) Download hadoop binary or build hadoop binary from hadoop source.
2) Prepare hadoop configuration to run on single node (Change Hadoop default tmp dir location from /tmp to some other reliable location)
3) Add following script to the $HADOOP_HOME/bin directory and chmod it to 744.
4) Format HDFS – bin/hadoop namenode -format (for Hadoop 0.20 and below), bin/hdfs namenode -format (for version > 0.21)
5) Start HDFS bin/start-dfs.sh (This will start Namenode and 1 data node ) which can be viewed on http://localhost:50070
6) Start additional data nodes using bin/run-additionalDN.sh More Details
run-additionalDN.sh
#!/bin/sh
# This is used for starting multiple datanodes on the same machine.
# run it from hadoop-dir/ just like 'bin/hadoop'
#Usage: run-additionalDN.sh [start|stop] dnnumber
#e.g. run-datanode.sh start 2
DN_DIR_PREFIX="/path/to/store/data_and_log_of_additionalDN/"
if [ -z $DN_DIR_PREFIX ]; then
echo $0: DN_DIR_PREFIX is not set. set it to something like "/hadoopTmp/dn"
exit 1
fi
run_datanode () {
DN=$2
export HADOOP_LOG_DIR=$DN_DIR_PREFIX$DN/logs
export HADOOP_PID_DIR=$HADOOP_LOG_DIR
DN_CONF_OPTS="\
-Dhadoop.tmp.dir=$DN_DIR_PREFIX$DN\
-Ddfs.datanode.address=0.0.0.0:5001$DN \
-Ddfs.datanode.http.address=0.0.0.0:5008$DN \
-Ddfs.datanode.ipc.address=0.0.0.0:5002$DN"
bin/hadoop-daemon.sh --script bin/hdfs $1 datanode $DN_CONF_OPTS
}
cmd=$1
shift;
for i in $*
do
run_datanode $cmd $i
done
I hope this will help you
Upvotes: 4