nik686
nik686

Reputation: 705

Read a text file from HDFS line by line in mapper

Is the following code for Mappers, reading a text file from HDFS right? And if it is:

  1. What happens if two mappers in different nodes try to open the file at almost the same time?
  2. Isn't there a need to close the InputStreamReader? If so, how to do it without closing the filesystem?

My code is:

Path pt=new Path("hdfs://pathTofile");
FileSystem fs = FileSystem.get(context.getConfiguration());
BufferedReader br=new BufferedReader(new InputStreamReader(fs.open(pt)));
String line;
line=br.readLine();
while (line != null){
System.out.println(line);

Upvotes: 9

Views: 30699

Answers (2)

Amreen zaidi
Amreen zaidi

Reputation: 1

import java.io.{BufferedReader, InputStreamReader}


def load_file(path:String)={
    val pt=new Path(path)
    val fs = FileSystem.get(new Configuration())
    val br=new BufferedReader(new InputStreamReader(fs.open(pt)))
    var res:List[String]=  List()
    try {

      var line=br.readLine()
      while (line != null){
        System.out.println(line);

        res= res :+ line
        line=br.readLine()
      }
    } finally {
      // you should close out the BufferedReader
      br.close();
    }

    res
  }

Upvotes: 0

Chris White
Chris White

Reputation: 30089

This will work, with some amendments - i assume the code you've pasted is just truncated:

Path pt=new Path("hdfs://pathTofile");
FileSystem fs = FileSystem.get(context.getConfiguration());
BufferedReader br=new BufferedReader(new InputStreamReader(fs.open(pt)));
try {
  String line;
  line=br.readLine();
  while (line != null){
    System.out.println(line);

    // be sure to read the next line otherwise you'll get an infinite loop
    line = br.readLine();
  }
} finally {
  // you should close out the BufferedReader
  br.close();
}

You can have more than one mapper reading the same file, but there is limit at which it makes more sense to use the Distributed Cache (not only reducing the load on the data nodes which host the blocks for the file but also will be more efficient if you have a job with a larger number of tasks than you have task nodes)

Upvotes: 23

Related Questions