Reputation: 696
I am trying to feed my mapper in a mapreduce project one sentence at a time for some text analysis. this text could look something like:
Nicholas looked at her face with surprise. It was the same face he had projected against the epiphysial cartilage. This arrangement favours during the whole time of the reading, gazed at his delicate fingers the frontier. convincing unacceptable confrontation swiftly paid joke instant hospitals. The one and the other may serve as a pastime. But what's chief officials.
however hadoops fileinputformat reads the following:
How do I program hadoop's inputformat to read entire sentences delmited by a "." ? i tries using a key value inputformat but hadoop always seems to cut a sentence and a breakline.
Upvotes: 0
Views: 589
Reputation: 665
You can use TextInputFormat
and set the textinputformat.record.delimiter
property in your configuration.
conf.set("textinputformat.record.delimiter", ".");
//EDIT
The below code gives your desired output, using the property above:
package dotdelimiter;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class App extends Configured implements Tool {
public static class SimpleMapper
extends Mapper<LongWritable, Text, NullWritable, Text> {
@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String val = "START\n" + value.toString() + "\nEND";
context.write(NullWritable.get(), new Text(val));
}
}
public static void main(String[] args) {
int result = 1;
try {
result = ToolRunner.run(new App(), args);
} catch (Exception e) {
e.printStackTrace();
} finally {
System.exit(result);
}
}
public int run(String[] args) throws Exception {
Job job = Job.getInstance(getConf(), "Dotdelimiter Job");
job.setJarByClass(getClass());
Configuration conf = job.getConfiguration();
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapOutputKeyClass(LongWritable.class);
job.setMapOutputValueClass(Text.class);
job.setMapperClass(SimpleMapper.class);
job.setNumReduceTasks(0);
conf.set("textinputformat.record.delimiter", ".");
return job.waitForCompletion(true) ? 0 : 1;
}
}
Output:
START
Nicholas looked at her face with surprise
END
START
It was the same face he had projected against the epiphysial cartilage
END
START
This arrangement favours during the whole time of the reading, gazed at his delicate fingers the frontier
END
START
convincing unacceptable confrontation swiftly paid joke instant hospitals
END
START
The one and the other may serve as a pastime
END
START
But what's chief officials
END
Upvotes: 0
Reputation: 14
You can create a custom input format to read the sentence delimited by ".".
For this you need to create a RecordReader and a class lets say MyValue which implements writableComparable interface.
This class you can use to pass as value type in your mapper.
I will try to implement this at my end, will updated this post in coming days. Read about custom input dormat you might get solution on your own.
Upvotes: 0
Reputation: 100050
You can't make the standard hadoop input format be a sentence boundary detector. If you want nontrivial (statistical) sentence breaking, you need a separate map job that does sentence splitting, and then you will have sentences as units. There are any number of open source NLP libraries which you can integrate for this purpose. If you want something trivial that mistakes abbreviations for sentences, you can shove it into an input format.
Upvotes: -1