Reputation: 5260
I am trying to understand the doc which says "The TextInputFormat works as An InputFormat for plain text files. Files are broken into lines. Either linefeed or carriage-return are used to signal end of line. Keys are the position in the file, and values are the line of text"
What does "position" mean? does it mean the line number in the file?
Given data in a file
dobbs 2007 20 18 15
dobbs 2008 22 20 12
doctor 2007 545525 366136 57313
doctor 2008 668666 446034 72694
Would it produce a map input like this?
(1, "dobbs 2007 20 18 15")
(2, "dobbs 2008 22 20 12")
(3, "doctor 2007 545525 366136 57313")
(4, "doctor 2008 668666 446034 72694")
Upvotes: 0
Views: 132
Reputation: 3854
In TextInputFormat, Keys are the byte offset
in the file from the beginning of the file to the line
i.e., for the first line, offset or key will be 0
for the second line the offset or key will be length of first line
for the third line offset will be offset of first line + length of first line
No, it will not produce map input as you expects, (assuming each word is separated by single space) it would rather be something like
(0,dobbs 2007 20 18 15)
(20,dobbs 2008 22 20 12)
(40,doctor 2007 545525 366136 57313)
(71,doctor 2008 668666 446034 72694)
Upvotes: 2