Reputation: 349

DataJoins in Hadoop MapReduce

I am trying to implement one use case as given in Book Hadoop In Action, but I am not being to compile the code. I am new to Java so, not being able to understand the exact reasons behind the errors.

Interesting thing is, another piece of coding using same classes and methods are compiled successfully.

hadoop@hadoopnode1:~/hadoop-0.20.2/playground/src$ javac -classpath /home/hadoop/hadoop-0.20.2/hadoop-0.20.2-core.jar:/home/hadoop/hadoop-0.20.2/lib/commons-cli-1.2.jar:/home/hadoop/hadoop-0.20.2/contrib/datajoin/hadoop-0.20.2-datajoin.jar -d ../classes DataJoin2.java 
DataJoin2.java:49: cannot find symbol
symbol  : constructor TaggedWritable(org.apache.hadoop.io.Text)
location: class DataJoin2.TaggedWritable
            TaggedWritable retv = new TaggedWritable((Text) value);
                                  ^
DataJoin2.java:69: cannot find symbol
symbol  : constructor TaggedWritable(org.apache.hadoop.io.Text)
location: class DataJoin2.TaggedWritable
            TaggedWritable retv = new TaggedWritable(new Text(joinedStr));
                                  ^
DataJoin2.java:113: setMapperClass(java.lang.Class<? extends org.apache.hadoop.mapreduce.Mapper>) in org.apache.hadoop.mapreduce.Job cannot be applied to (java.lang.Class<DataJoin2.MapClass>)
        job.setMapperClass(MapClass.class);
           ^
DataJoin2.java:114: setReducerClass(java.lang.Class<? extends org.apache.hadoop.mapreduce.Reducer>) in org.apache.hadoop.mapreduce.Job cannot be applied to (java.lang.Class<DataJoin2.Reduce>)
        job.setReducerClass(Reduce.class);
           ^
4 errors

----------------code----------------------

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;


import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapred.KeyValueTextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

// DataJoin Classes
import org.apache.hadoop.contrib.utils.join.DataJoinMapperBase;
import org.apache.hadoop.contrib.utils.join.TaggedMapOutput;
import org.apache.hadoop.contrib.utils.join.DataJoinReducerBase;

import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.WritableComparable;


public class DataJoin2
{
    public static class MapClass extends DataJoinMapperBase
    {
        protected Text generateInputTag(String inputFile)
        {
            String datasource = inputFile.split("-")[0];
            return new Text(datasource);            
        }

        protected Text generateGroupKey(TaggedMapOutput aRecord)
        {
            String line = ((Text) aRecord.getData()).toString();
            String[] tokens = line.split(",");
            String groupKey = tokens[0];
            return new Text(groupKey);
        }

        protected TaggedMapOutput generateTaggedMapOutput(Object value)
        {
            TaggedWritable retv = new TaggedWritable((Text) value);
            retv.setTag(this.inputTag);
            return retv;
        }
    } // End of class MapClass

    public static class Reduce extends DataJoinReducerBase
    {
        protected TaggedMapOutput combine(Object[] tags, Object[] values)
        {
            if (tags.length < 2) return null;
            String joinedStr = "";
            for (int i=0;i<values.length;i++)
            {
                if (i>0) joinedStr += ",";
                TaggedWritable tw = (TaggedWritable) values[i];
                String line = ((Text) tw.getData()).toString();
                String[] tokens = line.split(",",2);
                joinedStr += tokens[1];
            }
            TaggedWritable retv = new TaggedWritable(new Text(joinedStr));
            retv.setTag((Text) tags[0]);
            return retv;
        }
    } // End of class Reduce

    public static class TaggedWritable extends TaggedMapOutput 
    {
        private Writable data;

        public TaggedWritable()
        {
            this.tag = new Text("");
            this.data = data;
        }

        public Writable getData()
        {
            return data;
        }

        public void write(DataOutput out) throws IOException
        {
            this.tag.write(out);
            this.data.write(out);
        }

        public void readFields(DataInput in) throws IOException
        {
            this.tag.readFields(in);
            this.data.readFields(in);
        }       
    } // End of class TaggedWritable

    public static void main(String[] args) throws Exception
    {
        Configuration conf = new Configuration();
        String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
        if (otherArgs.length != 2) {
          System.err.println("Usage: DataJoin2 <in> <out>");
          System.exit(2);
        }
        Job job = new Job(conf, "DataJoin");
        job.setJarByClass(DataJoin2.class);     
        job.setMapperClass(MapClass.class);
        job.setReducerClass(Reduce.class);
        job.setInputFormatClass(TextInputFormat.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(TaggedWritable.class);

        FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
        FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);               
    }
}

Upvotes: 1

Answers (3)

Colonna Maurizio

Reputation: 97

I have hadoop-2,7,1, for me worked to add dependency from MAven, in the pom.xml

<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-datajoin</artifactId>
<version>2.7.1</version>
</dependency>

This is the Url for hadoop-datajoin : https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-datajoin

Upvotes: 0

jason

Reputation: 241779

For your first two error messages, the compiler errors are clearly telling you that you don't have a constructor for TaggedWritable that accepts an argument of type Text. It appears to me that you are making TaggedWritable serve as a wrapper for Writable to add a tag, so may I suggest adding constructor with:

public TaggedWritable(Writable data) {
    this.tag = new Text("");
    this.data = data;
}

In fact, as you've written it, this line

this.data = data;

just reassigns data to itself, so I'm pretty sure you intended to have a constructor argument named data. See my reasoning above for why I think you should make it Writable instead of Text. Since Text implements Writable, this will resolve your first two error messages.

However, you will need to keep a default no-arg constructor. This is because Hadoop will use reflection to instantiate an instance Writable values as it serializes them across the network between the map reduce phases. I think you have a tiny bit of a mess here for the default no-arg constructor:

public TaggedWritable() {
    this.tag = new Text("");
}

The reason that I see this as a mess is because if you don't assign to TaggedWritable.data a valid instance of whatever your wrapped Writable values are, you will get a NullPointerException when this.data.readFields(in) is invoked in TaggedWritable.readFields(DataInput). Since it's a general wrapper, you should probably make TaggedWritable a generic type and then use reflection to assign to TaggedWritable.data in the default no-arg constructor.

For your last two compiler errors, to use hadoop-datajoin I note that you need to be using the old API classes. Thus, all of these

org.apache.hadoop.mapreduce.Job;
org.apache.hadoop.mapreduce.Mapper;
org.apache.hadoop.mapreduce.Reducer;
org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
org.apache.hadoop.mapreduce.lib.input.TextInputFormat;

should be replaced by their old API equivalents. So org.apache.hadoop.mapred.JobConf instead of org.apache.hadoop.mapreduce.Job, etc. That will handle your last two error messages.

Upvotes: 1

Judge Mental

Reputation: 5239

There is nothing ambiguous about the error message. It is telling you that you did not provide a constructor for TaggedWritable which takes an argument of type Text. You only show a no-arg constructor in the code you posted.

Upvotes: 1

DataJoins in Hadoop MapReduce

Answers (3)

Related Questions