Is it possible to execute a Hadoop Streaming job that has no input file? In my use case, I'm able to generate the necessary records for the reducer with a single mapper and execution parameters. Currently, I'm using a stub input file with a single line, I'd like to remove this requirement. We have 2 use cases in mind. 1) I want to distribute the loading of files into hdfs from a network location available to all nodes. Basically, I'm going to run ls in the mapper and send the output to a small set of reducers. We are going to be running fits leveraging several different parameter ranges against several models. The model names do not change and will go to the reducer as keys while the list of tests to run is generated in the mapper.

Reputation: 1302

Hadoop Streaming Job with no input file

Is it possible to execute a Hadoop Streaming job that has no input file?

In my use case, I'm able to generate the necessary records for the reducer with a single mapper and execution parameters. Currently, I'm using a stub input file with a single line, I'd like to remove this requirement.

We have 2 use cases in mind.
1)

I want to distribute the loading of files into hdfs from a network location available to all nodes. Basically, I'm going to run ls in the mapper and send the output to a small set of reducers.
We are going to be running fits leveraging several different parameter ranges against several models. The model names do not change and will go to the reducer as keys while the list of tests to run is generated in the mapper.

Upvotes: 3

Answers (2)

user2314737

Reputation: 29317

No, it is not possible to execute a Hadoop Streaming job that has no input file.

The only two options that are required by mapred streaming are -input and -output.

From the Hadoop Streaming documentation:

mapred streaming [genericOptions] [streamingOptions]

where streamin options are one or more of

-input <directoryname> or <filename> Required (Input location for mapper)
-output <directoryname> Required (Output location for reducer)
-mapper <executable> or <JavaClassName> Optional (Mapper executable. If not specified, IdentityMapper is used as the default)
-reducer <executable> or <JavaClassName> Optional (Reducer executable. If not specified, IdentityReducer is used as the default)
[ . . . ] all other options are optional

So this is how a very minimal allowed MapReduce streaming job:

mapred streaming \
    -input my_input \
    -output my_output

This job will just echo the contents of my_input into my_output whereas each line is converted into a <key>, <value> pair separated by a tab.

Upvotes: 0

carpenter

Reputation: 1210

According to the docs this is not possible. The following are required parameters for execution:

input directoryname or filename
output directoryname
mapper executable or JavaClassName
reducer executable or JavaClassName

It looks like providing a dummy input file is the way to go currently.

Upvotes: 1

Hadoop Streaming Job with no input file

Answers (2)

Related Questions