Reputation: 705
Is the following code for Mappers, reading a text file from HDFS right? And if it is:
InputStreamReader
? If so, how to do it without closing the filesystem?My code is:
Path pt=new Path("hdfs://pathTofile");
FileSystem fs = FileSystem.get(context.getConfiguration());
BufferedReader br=new BufferedReader(new InputStreamReader(fs.open(pt)));
String line;
line=br.readLine();
while (line != null){
System.out.println(line);
Upvotes: 9
Views: 30699
Reputation: 1
import java.io.{BufferedReader, InputStreamReader}
def load_file(path:String)={
val pt=new Path(path)
val fs = FileSystem.get(new Configuration())
val br=new BufferedReader(new InputStreamReader(fs.open(pt)))
var res:List[String]= List()
try {
var line=br.readLine()
while (line != null){
System.out.println(line);
res= res :+ line
line=br.readLine()
}
} finally {
// you should close out the BufferedReader
br.close();
}
res
}
Upvotes: 0
Reputation: 30089
This will work, with some amendments - i assume the code you've pasted is just truncated:
Path pt=new Path("hdfs://pathTofile");
FileSystem fs = FileSystem.get(context.getConfiguration());
BufferedReader br=new BufferedReader(new InputStreamReader(fs.open(pt)));
try {
String line;
line=br.readLine();
while (line != null){
System.out.println(line);
// be sure to read the next line otherwise you'll get an infinite loop
line = br.readLine();
}
} finally {
// you should close out the BufferedReader
br.close();
}
You can have more than one mapper reading the same file, but there is limit at which it makes more sense to use the Distributed Cache (not only reducing the load on the data nodes which host the blocks for the file but also will be more efficient if you have a job with a larger number of tasks than you have task nodes)
Upvotes: 23