Reputation: 5

hadoop - total line of input files

I have an input file contains:

id   value
1e   1
2e   1
...
2e   1
3e   1
4e   1

And I would like to find the total id of my input file. So In my main, I have declare a list so that when I read the input file, I will insert the line into the list

MainDriver.java public static Set list = new HashSet();

and I my map

// Apply regex to find the id
...

// Insert id to the list
MainDriver.list.add(regex.group(1));    // add 1e, 2e, 3e ...

and In my reduce, I try to use the list as

 public void reduce(WritableComparable key, Iterator values,
            OutputCollector output, Reporter reporter) throws IOException 
    {
        ...
        output.collect(key, new IntWritable(MainDriver.list.size()));
    }

So I expect the value print out the file, in this case will be 4. But it actually prints out 0.

I have verify that regex.group(1) would extract valid id. So I have no clue why the size of my list is 0 in the reduce process.

Upvotes: 0

Answers (2)

whitfin

Reputation: 4629

This is basically ignoring the advantage of using MapReduce in the first place.

Correct me if I'm wrong, but it appears you can map your output from your Mapper by "id", and then in your Reducer you receive something like Text key, Iterator values as the parameters.

You can then just sum up values and output output.collect(key, <total value>);

Example (apologies for using Context rather than OutputCollector, but the logic is the same):

 public static class MyMapper extends Mapper<LongWritable, Text, Text, Text> {

    private final Text key = new Text("id");
    private final Text id = new Text();

    public void map(LongWritable key, Text value,
                    Context context) throws IOException, InterruptedException {
         id.set(regex.group(1)); // do whatever you do
         context.write(id, countOne);
    }

}

public static class MyReducer extends Reducer<Text, Text, Text, IntWritable> {

    private final IntWritable totalCount = new IntWritable();

    public void reduce(Text key, Iterable<Text> values,
                       Context context) throws IOException, InterruptedException {

        int cnt = 0;
        for (Text value : values) {
            cnt ++;
        }

        totalCount.set(cnt);
        context.write(key, totalCount);
    }

}

Upvotes: 0

Jeremy Beard

Reputation: 2725

The mappers and reducers run on separate JVMs (and often separate machines altogether) both from each other and from the driver program, so there is no common instance of your list Set variable that all of those methods can concurrently read and write to.

One way in MapReduce to count the number of keys is:

Emit (id, 1) from your mapper
(optionally) Sum the 1s for each mapper using a combiner to minimize network and reducer I/O
In the reducer:
- In setup() initialize a class-scope numeric variable (int or long presumbly) to 0
- In reduce() increment the counter, and ignore the values
- In cleanup() emit the counter value now that all keys have been processed
Run the job with a single reducer, so all the keys go to the same JVM where a single count can be made

Upvotes: 1

hadoop - total line of input files

Answers (2)

Related Questions