Reputation: 111
I have a fairly large text file that I would like to convert into a SequenceFile. Unfortunately, the file consists of Python code with logical lines running over several physical lines. For example,
print "Blah Blah\
... blah blah"
Each logical line is terminated by a NEWLINE. Could someone clarify how I could possibly generate Key, Value pairs in Map-Reduce where each Value is the entire logical line?
Upvotes: 2
Views: 3143
Reputation: 20969
I don't find the question asked earlier, but you just have to iterate over your lines via a simple mapreduce job and save them into a StringBuilder. Flush the StringBuilder to the context if you want to begin with a new record. The trick is to setup the StringBuilder in your mappers class as a field and not as a local variable.
here it is: Processing paraphragraphs in text files as single records with Hadoop
Upvotes: 4
Reputation: 5618
Preprocess the input file to remove the newlines. What is your goal in creating the SequenceFile?
Upvotes: 0
Reputation: 10652
You should create your own variation on TextInputFormat. In there you make a new RecordReader that skips lines until it sees the start of a logical line.
Upvotes: 1