How can I tell the reducer task id/number that my script is running under in Hadoop streaming?

I would like the output of my streaming reducer task to be different for partition number 0 than for the other partitions. How can I tell from within my script what reducer task it is running as?

Upvotes: 0

Answers (2)

Yann

Reputation: 361

As Nonnib said, if you run your job on MR2/Yarn: mapreduce_task_id is not set. Use mapred_task_id instead.

The only reference I have is a Vowpal Wabbit script (also, I use it in my Yarn jobs and it is works well with version up to Hadoop 2.0.0-cdh4.6.0)

Upvotes: 1

Mateo

Reputation: 1604

I just figured out that there are environment variables mapreduce_task_id and mapreduce_task_partition that one can access from within the script. These will have different values for different reduce tasks for example, task 0 has:

mapreduce_task_id=task_1410791469618_0007_r_000000

whereas, task 1 has:

mapreduce_task_id=task_1410791469618_0007_r_000001

Similarly, task 0 has:

mapreduce_task_partition=0

and

mapreduce_task_partition=1.

In Python, these can be accessed as follows:

import os 
my_task_id = os.environ.get('mapreduce_task_partition')

Upvotes: 1

How can I tell the reducer task id/number that my script is running under in Hadoop streaming?

Answers (2)

Related Questions