Scala RDD count by range

Question

I need to "extract" some data contained in an Iterable[MyObject] (it was a RDD[MyObject] before a groupBy).

My initial RDD[MyObject] :

|-----------|---------|----------|
| startCity | endCity | Customer |
|-----------|---------|----------|
| Paris     | London  | ID | Age |
|           |         |----|-----|
|           |         |  1 | 1   |
|           |         |----|-----|
|           |         |  2 | 1   |
|           |         |----|-----|
|           |         |  3 | 50  |
|-----------|---------|----------|
| Paris     | London  | ID | Age |
|           |         |----|-----|
|           |         |  5 | 40  |
|           |         |----|-----|
|           |         |  6 | 41  |
|           |         |----|-----|
|           |         |  7 | 2   |
|-----------|---------|----|-----|
| New-York  | Paris   | ID | Age |
|           |         |----|-----|
|           |         |  9 | 15  |
|           |         |----|-----|
|           |         |  10| 16  |
|           |         |----|-----|
|           |         |  11| 46  |
|-----------|---------|----|-----|
| New-York  | Paris   | ID | Age |
|           |         |----|-----|
|           |         |  13| 7   |
|           |         |----|-----|
|           |         |  14| 9   |
|           |         |----|-----|
|           |         |  15| 60  |
|-----------|---------|----|-----|
| Barcelona | London  | ID | Age |
|           |         |----|-----|
|           |         |  17| 66  |
|           |         |----|-----|
|           |         |  18| 53  |
|           |         |----|-----|
|           |         |  19| 11  |
|-----------|---------|----|-----|

I need to count them by age range by and groupBy startCity - endCity

The final result should be :

|-----------|---------|-------------|
| startCity | endCity | Customer    |
|-----------|---------|-------------|
| Paris     | London  | Range| Count|
|           |         |------|------|
|           |         |0-2   | 3    |
|           |         |------|------|
|           |         |3-18  | 0    |
|           |         |------|------|
|           |         |19-99 | 3    |
|-----------|---------|-------------|
| New-York  | Paris   | Range| Count|
|           |         |------|------|
|           |         |0-2   | 0    |
|           |         |------|------|
|           |         |3-18  | 3    |
|           |         |------|------|
|           |         |19-99 | 2    |
|-----------|---------|-------------|
| Barcelona | London  | Range| Count|
|           |         |------|------|
|           |         |0-2   | 0    |
|           |         |------|------|
|           |         |3-18  | 1    |
|           |         |------|------|
|           |         |19-99 | 2    |
|-----------|---------|-------------|

At the moment I'm doing this by count 3 times the same data (first time with 0-2 range, then 10-20, then 21-99).

Like :

Iterable[MyObject] ite

ite.count(x => x.age match {
    case Some(age) => { age >= 0 && age < 2 }
}

It's working by giving me an Integer but not efficient at all I think since I have to count many times, what's the best way to do this please ?

Thanks

EDIT : The Customer object is a case class

Oli · Accepted Answer

def computeRange(age : Int) = 
    if(age<=2)
        "0-2"
    else if(age<=10)
        "2-10"
    // etc, you get the idea

Then, with an RDD of case class MyObject(id : String, age : Int)

rdd
   .map(x=> computeRange(x.age) -> 1)
   .reduceByKey(_+_)

Edit: If you need to group by some columns, you can do it this way, provided that you have a RDD[(SomeColumns, Iterable[MyObject])]. The following lines would give you a map that associates each "range" to its number of occurences.

def computeMapOfOccurances(list : Iterable[MyObject]) : Map[String, Int] =
    list
        .map(_.age)
        .map(computeRange)
        .groupBy(x=>x)
        .mapValues(_.size)

val result1 = rdd
    .mapValues( computeMapOfOccurances(_))

And if you need to flatten your data, you can write:

val result2 = result1
    .flatMapValues(_.toSeq)

Scala RDD count by range

Answers (2)

Related Questions