How to use custom types in Hadoop

Question

I'm trying to make a modified version of the classic word count program, where the target output is the name of an input document and the number of unique words it contains.

To achieve this I planned to use a custom datatype to use as a key, where the datatype contains the name of an input file and a word. ie: DataStruct = [filename, word].

My plan is to do this in two passes, in the first I map the input files to (DataStruct, 1) key-value pairs, and then reduce this to -> (DataStruct, count). I envision every line to be formatted like this:

..
file1 word 4
file2 word 6
..

I will then do another pass where the map phase produces (filename, 1) pairs and the reducer produces the desired (filename, count) output.

The first (and minor) question I have is whether this is a reasonable way to approach this problem, there isn't a lot of hadoop material available online for reference so I'd appreciate any pointers someone experienced in this field can give me.

The major question I have, and where the trouble I am encountering is, is in the output of my first phase. I expected after implementing the following class into my code that I would get the desired file word count output but it is not so.

    public static class DataStruct implements WritableComparable {
        private Text word;
        private Text filename;

        public DataStruct(Text w, Text fn) {
            word = w;
            filename = fn;
        }

        public DataStruct() {
            word = new Text();
            filename = new Text();
        }

        public void set(Text w, Text fn) {
            word = w;
            filename = fn;
        }

        public Text getFilename() {
            return filename;
        }

        public Text getWord() {
            return word;
        }

        @Override
        public int compareTo(DataStruct d) {
            if(word.compareTo(d.word) == 0) {
                return filename.compareTo(d.filename);
            }
            else return word.compareTo(d.word);
        }

        @Override
        public boolean equals(Object o) {
            if(o instanceof DataStruct) {
                DataStruct other = (DataStruct) o;
                return word.equals(other.word) && filename.equals(other.filename);
            }
            else return false;
        }

        @Override
        public void readFields(DataInput in) throws IOException {
            word.readFields(in);
            filename.readFields(in);
        }

        @Override
        public void write(DataOutput out) throws IOException {
            word.write(out);
            filename.write(out);
        }

        @Override
        public int hashCode() {
            String combine = word.toString()+filename.toString();
            return combine.hashCode();
        }

    }

My output instead looks like this:

..
UniqueWordsDocument$DataStruct@a3cd2dd1 1
UniqueWordsDocument$DataStruct@1f6943cc 1
..

and I can't find anything online that explains this. I have figured out that the value after the @ is the hashcode of the data but i do not know how to proceed without having the filename and word in the output. If someone can explain what is happening here and/or how to fix this issue I would be incredibly appreciative.

Thanks for your help.

Ben Watson · Accepted Answer

You need to override the public String toString() method in your DataStruct class.

As things stand, Java has no idea how to display your DataStruct objects, and so just prints a reference to the objects themselves.

Your may want to have something like:

@Override
public String toString() {
    return word.toString() + "-" + filename.toString();
}

How to use custom types in Hadoop

Answers (1)

Related Questions