Hadoop Mapreduce: TextInputFormat: Meaning of position

Question

I am trying to understand the doc which says "The TextInputFormat works as An InputFormat for plain text files. Files are broken into lines. Either linefeed or carriage-return are used to signal end of line. Keys are the position in the file, and values are the line of text"

What does "position" mean? does it mean the line number in the file?

Given data in a file

  dobbs   2007      20      18     15
  dobbs   2008      22      20     12
  doctor  2007  545525  366136  57313
  doctor  2008  668666  446034  72694

Would it produce a map input like this?

  (1,  "dobbs   2007    20  18  15")
  (2,  "dobbs   2008    22  20  12")
  (3,  "doctor  2007    545525  366136  57313")
  (4,  "doctor  2008    668666  446034  72694")

vishnu viswanath · Accepted Answer

In TextInputFormat, Keys are the byte offset in the file from the beginning of the file to the line

i.e., for the first line, offset or key will be 0 for the second line the offset or key will be length of first line
for the third line offset will be offset of first line + length of first line

No, it will not produce map input as you expects, (assuming each word is separated by single space) it would rather be something like

(0,dobbs 2007 20 18 15)
(20,dobbs 2008 22 20 12)
(40,doctor 2007 545525 366136 57313)
(71,doctor 2008 668666 446034 72694)

Hadoop Mapreduce: TextInputFormat: Meaning of position

Answers (1)

Related Questions