Reputation: 79
guys I have a problem with the method combinations
My code :
val myRDD = sc.parallelize(Seq("aaa bbb bbb"))
myRDD: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:27
scala> myRDD.foreach{println}
aaa bbb bbb
scala> myRDD.map(_.split(" ")).flatMap(_.combinations(2)).
| map(p=>(p.mkString(","),1)).
| reduceByKey(_+_).
| foreach{println}
(aaa,bbb,1)
(bbb,bbb,1)
I dont' why the output is not
(aaa,bbb,2)
(bbb,aaa,2)
(bbb,bbb,1)
Upvotes: 1
Views: 1426
Reputation: 67075
The scala documentation covers this pretty well I think:
Iterates over combinations. A combination of length n is a subsequence of the original sequence, with the elements taken in order. Thus, "xy" and "yy" are both length-2 combinations of "xyy", but "yx" is not. If there is more than one way to generate the same subsequence, only one will be returned.
For example, "xyyy" has three different ways to generate "xy" depending on whether the first, second, or third "y" is selected. However, since all are identical, only one will be chosen. Which of the three will be taken is an implementation detail that is not defined.
In your specific case this breaks down to something like:
(aaa, bbb)
(aaa, bbb) //Thrown out since it duplicates the first
(bbb, bbb)
Upvotes: 2
Reputation: 3770
In combination function , a combination of length n is the subsequence of the original sequence, with elements taken in order. So in your case, for (aaa,bbb,bbb) the possible subsequences are (aaa,bbb) and (bbb,bbb) but not (bbb,aaa) .
Please refer scala documentation
Upvotes: 1