Apache spark map-reduce explanation

Question

i'm wondering how works this little snippet:

if i have this text:

Ut quis pretium tellus. Fusce quis suscipit ipsum. Morbi viverra elit ut malesuada pellentesque. Fusce eu ex quis urna lobortis finibus. Integer aliquam faucibus neque id cursus. Nulla non massa odio. Fusce pretium felis felis, at malesuada felis blandit nec. Praesent ligula enim, gravida sit amet scelerisque eget, porta non mi. Aenean vitae maximus tortor, ac facilisis orci.

and this snippet code that count the occurences of each words on the text above:

        // Load  input data.
        JavaRDD input = sc.textFile(inputFile);
        // Split up into words.
        JavaRDD words = input.flatMap(new FlatMapFunction() {
            public Iterable call(String x) {
                return Arrays.asList(x.split(" "));
            }
        });
        // Transform into word and count.
        JavaPairRDD counts = words.mapToPair(new PairFunction() {
            public Tuple2 call(String x) {
                return new Tuple2(x, 1);
            }
        }).reduceByKey(new Function2() {
            public Integer call(Integer x, Integer y) {
                return x + y;
            }
        });

It's simple to understand that this line

JavaRDD words = input.flatMap(new FlatMapFunction() {
                public Iterable call(String x) {
                    return Arrays.asList(x.split(" "));
                }
            });

creates a dataset containing the whole words splitted by space

and this line gives at each tuple the value of one, so for example:

JavaPairRDD counts = words.mapToPair(new PairFunction() {
                public Tuple2 call(String x) {
                    return new Tuple2(x, 1);

Ut,1
quis,1 //go on

i'm confused on how reduceByKey works, and how it can count the occurences of each words?

thanks in advance.

sheh · Accepted Answer

reduceByKey groups tuples by the key (first argument in each tuple) and makes reduce for each of group.

Like this:

(Ut, 1), (quis, 1), ..., (quis, 1), ..., (quis, 1), ... mapToPair

               \            /             |                           reduceByKey
                      +
                 (quis, 1+1)              |
                       \                 /
                         \             /  
                                +
                            (quis, 2+1)

Apache spark map-reduce explanation

Answers (2)

Related Questions