surya
surya

Reputation: 287

Getting the file name in Hadoop mapper using Hadoop Pipes

How can I get the input file name which is being executed in the hadoop mapper in Hadoop Pipes?

I can easily get file name in java based map reducer like

FileSplit fileSplit = (FileSplit)context.getInputSplit();    String filename = fileSplit.getPath().getName();   System.out.println("File name "+filename); System.out.println("Directory and Filename"+fileSplit.getPath().toString());

but how can I get in C++;

Plz help me

Thanks

Upvotes: 2

Views: 2604

Answers (6)

Sridhar Pothamsetti
Sridhar Pothamsetti

Reputation: 66

Below code will be able to print the filename

filepath = os.environ['mapreduce_map_input_file']

filename = os.path.split(filepath)[-1]

print filename

Upvotes: 0

zeekvfu
zeekvfu

Reputation: 3403

By parsing the mapreduce_map_input_file(new) or map_input_file(deprecated) environment variable, you can get the map input file name.

Notice:
The two environment variables are case-sensitive, all letters should be lower-case.

Upvotes: 1

Boggio
Boggio

Reputation: 1148

If you are using HADOOP 2.x with Python:

file_name = os.environ['mapreduce_map_input_file']

Upvotes: 1

Joffrey
Joffrey

Reputation: 301

I have been struggled with the same problem. And I found the solution.

void map(HadoopPipes::MapContext& context) {                                                                                         
    string path;
    path = context.getInputSplit();                                                                                                    
    path.erase(path.end()-1);
}

I posted only reading filename part. getInputSplit() method returns the whole path of the file + some unknown character at the end. I want pure path of the file so remove the end character of the string. I have no idea why the weired character is added end of the string but let's use it just by removing the end character~!

Upvotes: 0

Suman
Suman

Reputation: 9571

Figured out how to do this in Python:

import os
filename = os.environ['map_input_file']

filename is the variable that you want - this will give you the filename that the mapper is working on.

Some other useful environment variables are:

  • mapred_job_id = the full job id
  • mapred_tip_id = the id of that specific mapper or reducer task

Upvotes: 0

Chris White
Chris White

Reputation: 30089

For streaming / pipes jobs, the job configuration is serialized to process environment variables.

The job configuration property that defines the input file is named map.input.file. The PipeMapRed class which launches the C++ program is responsible for this serialization (configure method, line 151), and ensures that the job conf property names are escaped (addJobConfToEnvironment method line 206/266) - meaning that all non a-Za-z0-9 characters are replaced with underscores (safeEnvVarName method, lines 276/284) - so the actual environment variable you're looking for in your c++ program will be named map_input_file.

I'm, not a c++ programmer, so i can't tell you how to obtain environment variables, but i'm sure it's simple enough.

Upvotes: 3

Related Questions