OiRc
OiRc

Reputation: 1622

Apache spark map-reduce explanation

i'm wondering how works this little snippet:

if i have this text:

Ut quis pretium tellus. Fusce quis suscipit ipsum. Morbi viverra elit ut malesuada pellentesque. Fusce eu ex quis urna lobortis finibus. Integer aliquam faucibus neque id cursus. Nulla non massa odio. Fusce pretium felis felis, at malesuada felis blandit nec. Praesent ligula enim, gravida sit amet scelerisque eget, porta non mi. Aenean vitae maximus tortor, ac facilisis orci.

and this snippet code that count the occurences of each words on the text above:

        // Load  input data.
        JavaRDD<String> input = sc.textFile(inputFile);
        // Split up into words.
        JavaRDD<String> words = input.flatMap(new FlatMapFunction<String, String>() {
            public Iterable<String> call(String x) {
                return Arrays.asList(x.split(" "));
            }
        });
        // Transform into word and count.
        JavaPairRDD<String, Integer> counts = words.mapToPair(new PairFunction<String, String, Integer>() {
            public Tuple2<String, Integer> call(String x) {
                return new Tuple2(x, 1);
            }
        }).reduceByKey(new Function2<Integer, Integer, Integer>() {
            public Integer call(Integer x, Integer y) {
                return x + y;
            }
        });

It's simple to understand that this line

JavaRDD<String> words = input.flatMap(new FlatMapFunction<String, String>() {
                public Iterable<String> call(String x) {
                    return Arrays.asList(x.split(" "));
                }
            });

creates a dataset containing the whole words splitted by space

and this line gives at each tuple the value of one, so for example:

JavaPairRDD<String, Integer> counts = words.mapToPair(new PairFunction<String, String, Integer>() {
                public Tuple2<String, Integer> call(String x) {
                    return new Tuple2(x, 1);

Ut,1
quis,1 //go on

i'm confused on how reduceByKey works, and how it can count the occurences of each words?

thanks in advance.

Upvotes: 0

Views: 781

Answers (2)

eliasah
eliasah

Reputation: 40380

reduceByKey is quite similar to reduce. They both take a function and use it to combine values.

reduceByKey runs several parallel reduce operations, one for each key in the dataset, where each operation combines values that have the same key.

Because datasets can have very large numbers of keys, reduceByKey is not implemented as an action that returns a value to the user program. Instead, it returns a new RDD consisting of each key and the reduced value for that key.

Reference: Learning Spark - Lightning-Fast Big Data Analysis - Chapter 4 - Working with Key/Value Pairs.

Upvotes: 1

sheh
sheh

Reputation: 1023

reduceByKey groups tuples by the key (first argument in each tuple) and makes reduce for each of group.

Like this:

(Ut, 1), (quis, 1), ..., (quis, 1), ..., (quis, 1), ... mapToPair

               \            /             |                           reduceByKey
                      +
                 (quis, 1+1)              |
                       \                 /
                         \             /  
                                +
                            (quis, 2+1)

Upvotes: 2

Related Questions