Sudha
Sudha

Reputation: 339

groupByKey in Spark dataset

Please help me understand the parameter we pass to groupByKey when it is used on a dataset

scala> val data = spark.read.text("Sample.txt").as[String]
data: org.apache.spark.sql.Dataset[String] = [value: string]

scala> data.flatMap(_.split(" ")).groupByKey(l=>l).count.show

In the above code, please help me understand what (l=>l) means in groupByKey(l=>l).

Upvotes: 2

Views: 32778

Answers (1)

Traian
Traian

Reputation: 1464

l =>l says use the whole string(in your case that's every word as you're tokenizing on space) will be used as a key. This way you get all occurrences of each word in same partition and you can count them. - As you probably seen in other articles, it is preferable to use reduceByKey in this case so you don't need to collect all values for each key in memory before counting.

You need a function that derives your key from the dataset's data.

In your example, your function takes the whole string as is and uses it as the key. A different example will be, for a Dataset[String], to use as a key the first 3 characters of your string and not the whole string:

scala> val ds = List("abcdef", "abcd", "cdef", "mnop").toDS
ds: org.apache.spark.sql.Dataset[String] = [value: string]

scala> ds.show
+------+
| value|
+------+
|abcdef|
|  abcd|
|  cdef|
|  mnop|
+------+

scala> ds.groupByKey(l => l.substring(0,3)).keys.show
+-----+
|value|
+-----+
|  cde|
|  mno|
|  abc|
+-----+

group of key "abc" will have 2 values.

Here is the difference on how the key gets transformed vs the (l => l) so you can see better:

scala> ds.groupByKey(l => l.substring(0,3)).count.show
+-----+--------+
|value|count(1)|
+-----+--------+
|  cde|       1|
|  mno|       1|
|  abc|       2|
+-----+--------+


scala> ds.groupByKey(l => l).count.show
+------+--------+
| value|count(1)|
+------+--------+
|  abcd|       1|
|  cdef|       1|
|abcdef|       1|
|  mnop|       1|
+------+--------+

Upvotes: 15

Related Questions