xin
xin

Reputation: 135

Histogram for RDD in Scala?

I am facing an issue in creating a histogram in Scala. I have used histogram on an RDD.

For example: val eg = sc.parallelize(Seq(1,1,1,1,1,1,1,1,1,1)) eg.histogram(5) gives the output as: (Array[Double], Array[Long]) = (Array(1.0, 1.0),Array(10))

I expect the output to be like: (Array[Double], Array[Long]) = (Array(1.0, 1.2, 1.4, 1.6, 1.8, 2.0),Array(10, 0, 0, 0, 0)), but the function does not return the correct splits when the sequence is of same value.

Upvotes: 3

Views: 2457

Answers (2)

Oli
Oli

Reputation: 10406

As mentioned in the scaladoc of the RDD API, if the elements of the RDD do not vary (as in your case) there is going to be only one bucket, which is what you experience.

def histogram(bucketCount: Int): (Array[Double], Array[Long])

Compute a histogram of the data using bucketCount number of buckets evenly spaced between the minimum and maximum of the RDD. [...] If the elements in RDD do not vary (max == min) always returns a single bucket.

It works as you expect if I add a 2 in your sequence (so that min=1 and max=2)

sc.parallelize((0 until 10).map(_ => 1) :+ 2).histogram(5)
res75: (Array[Double], Array[Long]) = (Array(1.0, 1.2, 1.4, 1.6, 1.8, 2.0),Array(10, 0, 0, 0, 1))

You could also use this signature of the histogram method if you want to define the buckets yourself:

def histogram(buckets: Array[Double], evenBuckets: Boolean = false): Array[Long]

Upvotes: 3

Krzysztof Atłasik
Krzysztof Atłasik

Reputation: 22625

Instead of passing a number of buckets, you can explicitly pass buckets(splits) as an array:

eg.histogram(Array(1.0, 1.2, 1.4, 1.6, 1.8, 2.0))

The only difference is that you will just receive an array of longs back instead of tuple. If you want to get the same result as before, you'd need to create tuple yourself:

val buckets = Array(1.0, 1.2, 1.4, 1.6, 1.8, 2.0)
val histogram = eg.histogram(buckets)
val result = (buckets, histogram) //(Array[Double], Array[Long]) 

Upvotes: 1

Related Questions