dvk
dvk

Reputation: 111

Hadoop custom split of TextFile

I have a fairly large text file that I would like to convert into a SequenceFile. Unfortunately, the file consists of Python code with logical lines running over several physical lines. For example,
print "Blah Blah\
... blah blah"
Each logical line is terminated by a NEWLINE. Could someone clarify how I could possibly generate Key, Value pairs in Map-Reduce where each Value is the entire logical line?

Upvotes: 2

Views: 3143

Answers (3)

Thomas Jungblut
Thomas Jungblut

Reputation: 20969

I don't find the question asked earlier, but you just have to iterate over your lines via a simple mapreduce job and save them into a StringBuilder. Flush the StringBuilder to the context if you want to begin with a new record. The trick is to setup the StringBuilder in your mappers class as a field and not as a local variable.

here it is: Processing paraphragraphs in text files as single records with Hadoop

Upvotes: 4

David Medinets
David Medinets

Reputation: 5618

Preprocess the input file to remove the newlines. What is your goal in creating the SequenceFile?

Upvotes: 0

Niels Basjes
Niels Basjes

Reputation: 10652

You should create your own variation on TextInputFormat. In there you make a new RecordReader that skips lines until it sees the start of a logical line.

Upvotes: 1

Related Questions