Hadoop InputSplit for large text-based files

Question

In hadoop I'd like to split a file (almost) equally to each mapper. The file is large and I want to use a specific number of mappers at which are defined at job start. Now I've customized the input split but I want to be sure that if I split the file in two (or more splits) I won't cut a line in half as I want each mapper to have complete lines and not broken ones.

So the question is this, how can I get the approximate size of a filesplit during each creation or if that is not possible how I can estimate the number of (almost) equal filesplits for a large file given the constraint that I don't want to have any broken lines in any mapper instance.

Hadoop InputSplit for large text-based files

Answers (1)

Related Questions