Reputation: 135
I am facing an issue in creating a histogram in Scala. I have used histogram
on an RDD.
For example:
val eg = sc.parallelize(Seq(1,1,1,1,1,1,1,1,1,1))
eg.histogram(5)
gives the output as:
(Array[Double], Array[Long]) = (Array(1.0, 1.0),Array(10))
I expect the output to be like:
(Array[Double], Array[Long]) = (Array(1.0, 1.2, 1.4, 1.6, 1.8, 2.0),Array(10, 0, 0, 0, 0))
, but the function does not return the correct splits when the sequence is of same value.
Upvotes: 3
Views: 2457
Reputation: 10406
As mentioned in the scaladoc of the RDD API, if the elements of the RDD do not vary (as in your case) there is going to be only one bucket, which is what you experience.
def histogram(bucketCount: Int): (Array[Double], Array[Long])
Compute a histogram of the data using bucketCount number of buckets evenly spaced between the minimum and maximum of the RDD. [...] If the elements in RDD do not vary (max == min) always returns a single bucket.
It works as you expect if I add a 2 in your sequence (so that min=1
and max=2
)
sc.parallelize((0 until 10).map(_ => 1) :+ 2).histogram(5)
res75: (Array[Double], Array[Long]) = (Array(1.0, 1.2, 1.4, 1.6, 1.8, 2.0),Array(10, 0, 0, 0, 1))
You could also use this signature of the histogram
method if you want to define the buckets yourself:
def histogram(buckets: Array[Double], evenBuckets: Boolean = false): Array[Long]
Upvotes: 3
Reputation: 22625
Instead of passing a number of buckets, you can explicitly pass buckets(splits) as an array:
eg.histogram(Array(1.0, 1.2, 1.4, 1.6, 1.8, 2.0))
The only difference is that you will just receive an array of longs back instead of tuple. If you want to get the same result as before, you'd need to create tuple yourself:
val buckets = Array(1.0, 1.2, 1.4, 1.6, 1.8, 2.0)
val histogram = eg.histogram(buckets)
val result = (buckets, histogram) //(Array[Double], Array[Long])
Upvotes: 1