Stephen Kuo
Stephen Kuo

Reputation: 1295

Reading and learning Spark API?

I am learning Spark by example, but I don't know the good way to understand API. For instance, the very classic word count example:

val input = sc.textFile("README.md")
val words = input.flatMap(x => x.split(" "))
val result = words.map(x => (x, 1)).reduceByKey((x, y) => x + y)

When I read the reduceByKey API, I see:

def reduceByKey(func: (V, V) ⇒ V): RDD[(K, V)]

The API states: Merge the values for each key using an associative reduce function. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/ parallelism level.

In the programming guide: When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument.

Ok, through the example I know (x, y) is (V, V), and that should be the value part of the map. I give a function to compute the V and I get RDD[(K, V)]. My questions are: In such example, in reduceByKey(func: (V, V) ⇒ V), why 2 V? The 1st and 2nd V in (V, V) is same or not?

I guess I ask this question and therefore use the question topic due to that I still don't know how to correctly read the API, or I just miss some even basic Spark concept?!

Upvotes: 0

Views: 137

Answers (2)

Fabio Fantoni
Fabio Fantoni

Reputation: 3167

in the code below:

reduceByKey((x, y) => x + y)

you could read for more clarity, something like this:

reduceByKey((sum, addend) => sum + addend)

so, for every key, you iterate that function fore every element with that key.

Basically, (func: (V, V) ⇒ V), means that you have a function with 2 input of a certain type (let's say Int) which returns a single output of the same type.

Upvotes: 0

Knight71
Knight71

Reputation: 2959

Usually the data sets will be of the form ("key1",val11),("key2",val21),("key1",val12),("key2",val22)...so on

There will be the same key with multiple values in the RDD[(K,V)]

When you use the reduceByKey . For each values in the key the function will be applied.

For example consider the following program

 val data = Array(("key1",2),("key1",20),("key2",21),("key1",2),("key2",10),("key2",33))

 val rdd = sc.parallelize(data)
 val res = rdd.reduceByKey((x,y) => x+y)
 res.foreach(println)

You will get the output as

 (key2,64)
 (key1,24)

Here the Sequence of values are passed to the function . For key1 -> (2,20,2)

In the end , You will have a single value for each key.

You could use spark shell to try out the APIs.

Upvotes: 0

Related Questions