Mark Stroeven
Mark Stroeven

Reputation: 696

How does one make a hadoop mapper read entire sentences

I am trying to feed my mapper in a mapreduce project one sentence at a time for some text analysis. this text could look something like:

Nicholas looked at her face with surprise. It was the same face he had projected against the epiphysial cartilage. This arrangement favours during the whole time of the reading, gazed at his delicate fingers the frontier. convincing unacceptable confrontation swiftly paid joke instant hospitals. The one and the other may serve as a pastime. But what's chief officials.

however hadoops fileinputformat reads the following:

input

How do I program hadoop's inputformat to read entire sentences delmited by a "." ? i tries using a key value inputformat but hadoop always seems to cut a sentence and a breakline.

Upvotes: 0

Views: 589

Answers (3)

PetrosP
PetrosP

Reputation: 665

You can use TextInputFormat and set the textinputformat.record.delimiter property in your configuration.

conf.set("textinputformat.record.delimiter", ".");

//EDIT

The below code gives your desired output, using the property above:

package dotdelimiter;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class App extends Configured implements Tool {
    public static class SimpleMapper
            extends Mapper<LongWritable, Text, NullWritable, Text> {
        @Override
        protected void map(LongWritable key, Text value, Context context)
                throws IOException, InterruptedException {
            String val = "START\n" + value.toString() + "\nEND";
            context.write(NullWritable.get(), new Text(val));
        }
    }
    
    public static void main(String[] args) {
        int result = 1;
        try {
            result = ToolRunner.run(new App(), args);
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            System.exit(result);
        }
    }
    

    public int run(String[] args) throws Exception {
        Job job = Job.getInstance(getConf(), "Dotdelimiter Job");
        job.setJarByClass(getClass());

        Configuration conf = job.getConfiguration();

        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);

        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        job.setMapOutputKeyClass(LongWritable.class);
        job.setMapOutputValueClass(Text.class);

        job.setMapperClass(SimpleMapper.class);
        job.setNumReduceTasks(0);

        conf.set("textinputformat.record.delimiter", ".");

        return job.waitForCompletion(true) ? 0 : 1;
    }
}

Output:

START

Nicholas looked at her face with surprise

END

START

It was the same face he had projected against the epiphysial cartilage

END

START

This arrangement favours during the whole time of the reading, gazed at his delicate fingers the frontier

END

START

convincing unacceptable confrontation swiftly paid joke instant hospitals

END

START

The one and the other may serve as a pastime

END

START

But what's chief officials

END

Upvotes: 0

Mahesh Jadhav
Mahesh Jadhav

Reputation: 14

You can create a custom input format to read the sentence delimited by ".".

For this you need to create a RecordReader and a class lets say MyValue which implements writableComparable interface.

This class you can use to pass as value type in your mapper.

I will try to implement this at my end, will updated this post in coming days. Read about custom input dormat you might get solution on your own.

Upvotes: 0

bmargulies
bmargulies

Reputation: 100050

You can't make the standard hadoop input format be a sentence boundary detector. If you want nontrivial (statistical) sentence breaking, you need a separate map job that does sentence splitting, and then you will have sentences as units. There are any number of open source NLP libraries which you can integrate for this purpose. If you want something trivial that mistakes abbreviations for sentences, you can shove it into an input format.

Upvotes: -1

Related Questions