Reputation: 16489

Running MapReduce remotely

I have a hadoop cluster running remotely. I was able to go through the tutorial:

http://hadoop.apache.org/docs/r2.6.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html

on my remote machine because there was a built in hadoop instance. However, I wish to perform the same task locally. Being new to hadoop I am not sure how to. I was wondering if I could run the program and have the results sent back to my local machine. I'm not sure how to log on to my remote machine and then run the MapReduce job.

This is the code I have on my remote machine:

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {
    public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{

        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();

        public void map(Object key, Text value, Context context
        ) throws IOException, InterruptedException {
            StringTokenizer itr = new StringTokenizer(value.toString());
            while (itr.hasMoreTokens()) {
                word.set(itr.nextToken());
                context.write(word, one);
            }
        }
    }

    public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> {
        private IntWritable result = new IntWritable();

        public void reduce(Text key, Iterable<IntWritable> values,
                           Context context
        ) throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            result.set(sum);
            context.write(key, result);
        }
    }

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "word count");
        conf.set("mapred.job.queue.name", "exp_dsa");
        job.setJarByClass(WordCount.class);
        job.setMapperClass(TokenizerMapper.class);
        job.setCombinerClass(IntSumReducer.class);
        job.setReducerClass(IntSumReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

Upvotes: 2

Answers (4)

Manu Batham

Reputation: 331

Along with all the steps suggested by Serhiy, have to WinUtils as suggested in below article (in case user is running Eclipse on Windows)

Spark 1.6-Failed to locate the winutils binary in the hadoop binary path

And set the HADOOP_HOME path as system variable to /bin directory.

Upvotes: 0

fatih tekin

Reputation: 989

I had the same challenge for hadoop 2.7 and solved it by adding below configuration.

conf.set("yarn.resourcemanager.address", "127.0.0.1:8032"); 
conf.set("mapreduce.framework.name", "yarn");
conf.set("fs.default.name", "hdfs://127.0.0.1:9000");
conf.set("mapreduce.job.jar",".\\target\\wc-mvn-0.0.1-SNAPSHOT.jar");

Upvotes: 2

Serhiy

Reputation: 4141

I know it is a bit late for you, but people definitely could profit from my response, since I was seeking for very similar setup and being able to run the jobs remotely (even from Eclipse).

First let me mention, that you do not need any Hadoop Distribution on your machine to submit the jobs remotely (at least in Hadoop 2.6.0, which seems to be fine in your case according to release information and the date you posted the question). I will explain how to run the job from Eclipse.

Let me start with configuration. There are few resources which might provide some light on how this can be achieved, but none of the solutions provided by different one worked for me, without additional configurations.

On the server.

Assuming that you have Hadoop, Yarn, and HDFS installed, your first step should be configuring properly the system variables (you will need them later of course). I propose to edit the file called hadoop-env.sh (in my case located in /etc/hadoop/conf/) and include the following lines:
```
export HADOOP_CONF_DIR=/etc/hadoop/conf/
export HADOOP_COMMON_HOME=/usr/lib/hadoop/
export HADOOP_HDFS_HOME=/usr/lib/hadoop-hdfs/
export HADOOP_YARN_HOME=/usr/lib/hadoop-yarn/
export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce/
```
(where /usr/lib/hadoop/ corresponds to the directory where Hadoop was installed). Restart the services.
In core-site.xml, you should have the following configuration: fs.defaultFS, note it down somewhere and check if firewall has an open port so external client could perform data related operations. In case you do not have this configuration, add the following entry to the file:
```
<property>
    <name>fs.defaultFS</name>
    <value>hdfs://<host-name></value>
    <final>true</final>
</property>
```

Assuming that you have properly configured namenode (s) and datanode (s). Edit yarn-site.xml file and add the following entries (or check if they are present and note down the configs)

<property>
  <name>yarn.nodemanager.aux-services</name>
  <value>mapreduce_shuffle</value>
</property>
<property>
  <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
  <value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
  <name>yarn.resourcemanager.address</name>
  <value><your-hostname>:8050</value>
</property>
<property>
  <name>yarn.application.classpath</name>
  <value>
    $HADOOP_CONF_DIR,
    $HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*,
    $HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*,
    $HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*,
    $HADOOP_YARN_HOME/*,$HADOOP_YARN_HOME/lib/*
  </value>
</property>

(check Hadoop documentation for understanding the meaning of different configurations)

Modify the mapred-site.xml file with the following entries:

<property>
  <name>mapreduce.framework.name</name>
  <value>yarn</value>
</property>
<property>
  <name>yarn.app.mapreduce.am.staging-dir</name>
  <value>/user</value>
</property>

Restart the services. Server is mostly ready to go. Check if all required ports are accessible from the outside (there is quite complete listing on the following web-site, just some of them should be open, check with your sysadmins)

On the Client

Create a project in Eclipse (simple Java application). Create your Mapper and Reducer (there are many tutorials I will not give any examples here). Now in the Main class, you should provide the following configuration for your job (it might differ, depending on your security and system constraints, so this one you should probably dig yourself if you are unable to connect to server machine remotely)

Configuration conf = new Configuration();
conf.set("yarn.resourcemanager.address", "<your-hostname>:8050"); // see step 3
conf.set("mapreduce.framework.name", "yarn"); 
conf.set("fs.defaultFS", "hdfs://<your-hostname>/"); // see step 2
conf.set("yarn.application.classpath",        
             "$HADOOP_CONF_DIR,$HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*,"
                + "$HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*,"
                + "$HADOOP_YARN_HOME/*,$HADOOP_YARN_HOME/lib/*,"
                + "$HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*");

Job job = Job.getInstance(conf);
if (args.length>0) {
    job.setJar(args[0]); // see below, use this when submitting from Eclipse
} else { 
    job.setJarByClass(HadoopWorkloadMain.class); // use this when uploaded the Jar to the server and running the job directly and locally on the server
}
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(DoubleWritable.class);
job.setMapperClass(SomeMapper.class);
job.setCombinerClass(SomeReducer.class);
job.setReducerClass(SomeReducer.class);

FileInputFormat.addInputPath(job, new Path("/inputs/")); // existing HDFS directory
FileOutputFormat.setOutputPath(job, new Path("/results/")); // not existing HDFS directory

job.waitForCompletion(true);

The classpath configuration must be set according to this resource.

This should do the trick. Run your main and see the Hadoop working. Anyways I wish you best luck and patience, the task which sounds easy might take quite significant effort.

Troubleshooting:

Besides obvious Jars to be included in your client build-path, you might require to add less obvious Jars, check this SO Question to see what you should include additionally.

Upvotes: 6

SelimN

Reputation: 212

To achieve this, you need to have locally same copy of Hadoop Distribution and configuration files(core-site.xml, hdfs-site.xml and mapred-site.xml ) which are present at the Namenode.

Then you can submit jobs to the remote cluster from your machine using hadoop command.

Upvotes: 1

Running MapReduce remotely

Answers (4)

Related Questions