Reputation: 1295
I am learning Spark by example, but I don't know the good way to understand API. For instance, the very classic word count example:
val input = sc.textFile("README.md")
val words = input.flatMap(x => x.split(" "))
val result = words.map(x => (x, 1)).reduceByKey((x, y) => x + y)
When I read the reduceByKey
API, I see:
def reduceByKey(func: (V, V) ⇒ V): RDD[(K, V)]
The API states: Merge the values for each key using an associative reduce function. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/ parallelism level.
In the programming guide: When called on a dataset of (K, V)
pairs, returns a dataset of (K, V)
pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V
. Like in groupByKey
, the number of reduce tasks is configurable through an optional second argument.
Ok, through the example I know (x, y)
is (V, V)
, and that should be the value part of the map. I give a function to compute the V
and I get RDD[(K, V)]
. My questions are: In such example, in reduceByKey(func: (V, V) ⇒ V)
, why 2 V
? The 1st and 2nd V
in (V, V)
is same or not?
I guess I ask this question and therefore use the question topic due to that I still don't know how to correctly read the API, or I just miss some even basic Spark concept?!
Upvotes: 0
Views: 137
Reputation: 3167
in the code below:
reduceByKey((x, y) => x + y)
you could read for more clarity, something like this:
reduceByKey((sum, addend) => sum + addend)
so, for every key, you iterate that function fore every element with that key.
Basically, (func: (V, V) ⇒ V), means that you have a function with 2 input of a certain type (let's say Int) which returns a single output of the same type.
Upvotes: 0
Reputation: 2959
Usually the data sets will be of the form ("key1",val11),("key2",val21),("key1",val12),("key2",val22)...so on
There will be the same key with multiple values in the RDD[(K,V)]
When you use the reduceByKey . For each values in the key the function will be applied.
For example consider the following program
val data = Array(("key1",2),("key1",20),("key2",21),("key1",2),("key2",10),("key2",33))
val rdd = sc.parallelize(data)
val res = rdd.reduceByKey((x,y) => x+y)
res.foreach(println)
You will get the output as
(key2,64)
(key1,24)
Here the Sequence of values are passed to the function . For key1 -> (2,20,2)
In the end , You will have a single value for each key.
You could use spark shell to try out the APIs.
Upvotes: 0