Reputation: 281
I am new to hadoop and am running some of the examples to become more familiar with it. I ran wordcount and when I went to check the output hadoop fs -cat outt
I got 3 directories instead of the usual one named outt/part-00000. Here are the directories I have:
-rw-r--r-- 1 hadoop supergroup 0 2014-07-11 20:13 outt/_SUCCESS
-rw-r--r-- 1 hadoop supergroup 15 2014-07-11 20:13 outt/part-r-00000
-rw-r--r-- 1 hadoop supergroup 0 2014-07-11 20:13 outt/part-r-00001
When I do hadoop fs -cat outt/_SUCCESS
and hadoop fs -cat outt/part-r-00001
, nothing appears. However, when I do hadoop fs -cat outt/part-r-00000
I get: record_count 1.
My file just says "Hello World" so I am expecting the result: Hello 1 World 1.
Does anyone know how to get the correct output?
Upvotes: 0
Views: 579
Reputation: 13402
When you are doing hadoop fs -cat outt/part-r-00000
and getting output as : record_count 1
Which mean probably you are counting the number of lines in the input file.
Once you read a line, you need to tokenize the line and take each word (token) out of this.
Here is sample code:
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
You can find the full code here: WordCount
Here instead of StringTokenizer
you can you split method of java API.
Upvotes: 1
Reputation: 518
1.)The _success and part-r-00000/1 are not directories but files. Directory is more like a set of files and other directories
2.) _Success file is automatically created by hadoop if the submitted job is performed successfully by all the nodes and reducers and the result set is complete.
3.)If you are getting two part files it implies that you have two reducers in your job description. Check the code to find if there is any statement like job.setNumReduceTasks(2);
. The part named 00000 is the output of first reducer and 00001 is the output of the second reducer. 'r' implies that the output is from reducer. If see 'm' instead of 'r' it means that you dont have a reducer and the job is map only job.
Upvotes: 2