Yuhao
Yuhao

Reputation: 1610

How to serialize an Java Object in Hadoop?

Object should implement Writable interface in order to be serialized when transmitted in Hadoop. Take the Lucene ScoreDoc class as an example:

public class ScoreDoc implements java.io.Serializable {

  /** The score of this document for the query. */
  public float score;

  /** Expert: A hit document's number.
   * @see Searcher#doc(int) */
  public int doc;

  /** Only set by {@link TopDocs#merge} */
  public int shardIndex;

  /** Constructs a ScoreDoc. */
  public ScoreDoc(int doc, float score) {
    this(doc, score, -1);
  }

  /** Constructs a ScoreDoc. */
  public ScoreDoc(int doc, float score, int shardIndex) {
    this.doc = doc;
    this.score = score;
    this.shardIndex = shardIndex;
  }

  // A convenience method for debugging.
  @Override
  public String toString() {
    return "doc=" + doc + " score=" + score + " shardIndex=" + shardIndex;
  }
}

How should I serialize it with Writable interface? What is the connection between Writable and java.io.serializable interface?

Upvotes: 0

Views: 4296

Answers (2)

Tejas Patil
Tejas Patil

Reputation: 6169

I think that it wont be a good idea to tamper with the in-built Lucene class. Instead, have your own class which can will contain the fields of ScoreDoc type and would implement Hadoop writable in interface. It would be something like this:

public class MyScoreDoc implements Writable  {      

  private ScoreDoc sd;

  public void write(DataOutput out) throws IOException {
      String [] splits = sd.toString().split(" ");

      // get the score value from the string
      Float score = Float.parseFloat((splits[0].split("="))[1]);

      // do the same for doc and shardIndex fields
      // ....    

      out.writeInt(score);
      out.writeInt(doc);
      out.writeInt(shardIndex);
  }

  public void readFields(DataInput in) throws IOException {
      float score = in.readInt();
      int doc = in.readInt();
      int shardIndex = in.readInt();

      sd = new ScoreDoc (score, doc, shardIndex);
  }

  //String toString()
}

Upvotes: 1

tgkprog
tgkprog

Reputation: 4598

First see Hadoop: Easy way to have object as output value without Writable interface you can use Java serialization OR

See http://developer.yahoo.com/hadoop/tutorial/module5.html you need to make your own write and read function, its quite simple as inside can call the API to read and write int, flaot, string etc

Your example with Writable (need to import it)

public class ScoreDoc implements java.io.Serializable, Writable  {      
    /** The score of this document for the query. */
    public float score;//... as in above

  public void write(DataOutput out) throws IOException {
      out.writeInt(score);
      out.writeInt(doc);
      out.writeInt(shardIndex);
  }

  public void readFields(DataInput in) throws IOException {
      score = in.readInt();
      doc = in.readInt();
      shardIndex = in.readInt();    
  }

  //rest toStirng etc
}

Note : order of write and read should be the same or value of one will go to other, and if you have different types will get serialization errors on read

Upvotes: 0

Related Questions