Yuhao
Yuhao

Reputation: 1610

How to generate more than one key-value pairs for one input line in Hadoop Input Format?

Here is the background. I have the following input for my MapReduce job (example):

Apache Hadoop
Apache Lucene
StackOverflow
....

(Actually each line represents a user query. Not important here.) And I want my RecordReader class read one line and then pass several key-value pairs to mappers. For example, if RecordReader gets Apache Hadoop, then I want it to generate the following key-value pairs and pass it to mappers:

Apache Hadoop - 1
Apache Hadoop - 2
Apache Hadoop - 3

("-" is the separator here.) And I found RecordReader pass key-values in next() method:

next(key, value);

Every time a RecordReader.next() is called, only one key and one value will be passed as argument. So how should I get my work done?

Upvotes: 2

Views: 1675

Answers (3)

Winston
Winston

Reputation: 1212

I think if you want to send to the mapper use the same key; you must implement your owner RecordReader; for example you can wirte a MutliRecordReader to extends the LineRecordReade; and here you must change the nextKeyValue method; this is the original Code from LineRecordReade:

public boolean nextKeyValue() throws IOException {
    if (key == null) {
      key = new LongWritable();
    }
    key.set(pos);
    if (value == null) {
      value = new Text();
    }
    int newSize = 0;
    // We always read one extra line, which lies outside the upper
    // split limit i.e. (end - 1)
    while (getFilePosition() <= end) {
      newSize = in.readLine(value, maxLineLength,
          Math.max(maxBytesToConsume(pos), maxLineLength));
      pos += newSize;
      if (newSize < maxLineLength) {
        break;
      }

      // line too long. try again
      LOG.info("Skipped line of size " + newSize + " at pos " + 
               (pos - newSize));
    }
    if (newSize == 0) {
      key = null;
      value = null;
      return false;
    } else {
      return true;
    }
  }

and you can change it like this:

public boolean nextKeyValue() throws IOException {
    if (key == null) {
      key = new Text();
    }
    key.set(pos);
    if (value == null) {
      value = new Text();
    }
    int newSize = 0;

    while (getFilePosition() <= end && n<=3) {
      newSize = in.readLine(key, maxLineLength,
          Math.max(maxBytesToConsume(pos), maxLineLength));//change value --> key

     value =Text(n);
     n++;
     if(n ==3 )// we don't go to next until the N is three;
         pos += newSize;

      if (newSize < maxLineLength) {
        break;
      }

      // line too long. try again
      LOG.info("Skipped line of size " + newSize + " at pos " + 
               (pos - newSize));
    }
    if (newSize == 0) {
      key = null;
      value = null;
      return false;
    } else {
      return true;
    }
  }

I think this can suit for you

Upvotes: 1

aruns
aruns

Reputation: 418

Try not giving key:-

context.write(NullWritable.get(), new Text("Apache Hadoop - 1"));
context.write(NullWritable.get(), new Text("Apache Hadoop - 2"));
context.write(NullWritable.get(), new Text("Apache Hadoop - 3"));

Upvotes: 0

aa8y
aa8y

Reputation: 3942

I believe you can simply use this:

public static class MultiMapper 
        extends Mapper<LongWritable, Text, Text, IntWritable> {

    @Override
    public void map(LongWritable key, Text value, Context context) 
            throws IOException, InterruptedException {

        for (int i = 1; i <= n; i++) {
            context.write(value, new IntWritable(i));
        }
    }
}

Here n is the number of values you want to pass. For example for the key-value pairs you specified:

Apache Hadoop - 1
Apache Hadoop - 2
Apache Hadoop - 3

n would be 3.

Upvotes: 2

Related Questions