Reading and learning Spark API?

Question

I am learning Spark by example, but I don't know the good way to understand API. For instance, the very classic word count example:

val input = sc.textFile("README.md")
val words = input.flatMap(x => x.split(" "))
val result = words.map(x => (x, 1)).reduceByKey((x, y) => x + y)

When I read the reduceByKey API, I see:

def reduceByKey(func: (V, V) ⇒ V): RDD[(K, V)]

The API states: Merge the values for each key using an associative reduce function. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/ parallelism level.

In the programming guide: When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument.

Ok, through the example I know (x, y) is (V, V), and that should be the value part of the map. I give a function to compute the V and I get RDD[(K, V)]. My questions are: In such example, in reduceByKey(func: (V, V) ⇒ V), why 2 V? The 1st and 2nd V in (V, V) is same or not?

I guess I ask this question and therefore use the question topic due to that I still don't know how to correctly read the API, or I just miss some even basic Spark concept?!

Fabio Fantoni · Accepted Answer

in the code below:

reduceByKey((x, y) => x + y)

you could read for more clarity, something like this:

reduceByKey((sum, addend) => sum + addend)

so, for every key, you iterate that function fore every element with that key.

Basically, (func: (V, V) ⇒ V), means that you have a function with 2 input of a certain type (let's say Int) which returns a single output of the same type.

Reading and learning Spark API?

Answers (2)

Related Questions