user2590144
user2590144

Reputation: 281

Hadoop WordCount Output

I am new to hadoop and am running some of the examples to become more familiar with it. I ran wordcount and when I went to check the output hadoop fs -cat outt I got 3 directories instead of the usual one named outt/part-00000. Here are the directories I have:

-rw-r--r--   1 hadoop supergroup          0 2014-07-11 20:13 outt/_SUCCESS 
-rw-r--r--   1 hadoop supergroup         15 2014-07-11 20:13 outt/part-r-00000
-rw-r--r--   1 hadoop supergroup          0 2014-07-11 20:13 outt/part-r-00001

When I do hadoop fs -cat outt/_SUCCESS and hadoop fs -cat outt/part-r-00001, nothing appears. However, when I do hadoop fs -cat outt/part-r-00000 I get: record_count 1.

My file just says "Hello World" so I am expecting the result: Hello 1 World 1.

Does anyone know how to get the correct output?

Upvotes: 0

Views: 579

Answers (2)

YoungHobbit
YoungHobbit

Reputation: 13402

When you are doing hadoop fs -cat outt/part-r-00000 and getting output as : record_count 1

Which mean probably you are counting the number of lines in the input file.

Once you read a line, you need to tokenize the line and take each word (token) out of this.

Here is sample code:

String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
  word.set(tokenizer.nextToken());
  context.write(word, one);
}  

You can find the full code here: WordCount

Here instead of StringTokenizer you can you split method of java API.

Upvotes: 1

tacticurv
tacticurv

Reputation: 518

1.)The _success and part-r-00000/1 are not directories but files. Directory is more like a set of files and other directories

2.) _Success file is automatically created by hadoop if the submitted job is performed successfully by all the nodes and reducers and the result set is complete.

3.)If you are getting two part files it implies that you have two reducers in your job description. Check the code to find if there is any statement like job.setNumReduceTasks(2);. The part named 00000 is the output of first reducer and 00001 is the output of the second reducer. 'r' implies that the output is from reducer. If see 'm' instead of 'r' it means that you dont have a reducer and the job is map only job.

Upvotes: 2

Related Questions