Lazar Gugleta
Lazar Gugleta

Reputation: 115

How to properly iterate over Array[String]?

I have a function in scala which I send arguments to, I use it like this:

val evega = concat.map(_.split(",")).keyBy(_(0)).groupByKey().map{case (k, v) => (k, f(v))}

My function f is:

val formatter = DateTimeFormatter.ofPattern("yyyy-MM-dd")
implicit val localDateOrdering: Ordering[LocalDate] = Ordering.by(_.toEpochDay)
def f(v: Array[String]): Int = {
  val parsedDates = v.map(LocalDate.parse(_, formatter))
  parsedDates.max.getDayOfYear - parsedDates.min.getDayOfYear}

And this is the error I get:

 found   : Iterable[Array[String]]
 required: Array[String]

I already tried using:

val evega = concat.map(_.split(",")).keyBy(_(0)).groupByKey().map{case (k, v) => (k, for (date <- v) f(date))}

But I get massive errors.

Just to get a better picture, data in concat is:

1974,1974-06-22
1966,1966-07-20
1954,1954-06-19
1994,1994-06-27
1954,1954-06-26
2006,2006-07-04
2010,2010-07-07
1990,1990-06-30
...

It is type RDD[String]. How can I properly iterate over that and get a single Int from that function f?

Upvotes: 2

Views: 245

Answers (1)

Xavier Guihot
Xavier Guihot

Reputation: 61666

The RDD types alongside your pipeline are:

  • concat.map(_.split(",")) gives an RDD[Array[String]]
    • for instance Array("1954", "1954-06-19")
  • concat.map(_.split(",")).keyBy(_(0)) gives RDD[(String, Array[String])]
    • for instance ("1954", Array("1954", "1954-06-19"))
  • concat.map(_.split(",")).keyBy(_(0)).groupByKey() gives RDD[(String, Iterable[Array[String]])]
    • for instance Iterable(("1954", Iterable(Array("1954", "1954-06-19"), Array("1954", "1954-06-24"))))

Thus when you map at the end, the type of values is Iterable[Array[String]].

Since your input is "1974,1974-06-22", the solution could consist in replacing your keyBy transformation by a map:

input.map(_.split(",")).map(x => x(0) -> x(1)).groupByKey().map{case (k, v) => (k, f(v))}

Indeed, .map(x => x(0) -> x(1)) (instead of .map(x => x(0) -> x) whose keyBy(_(0)) is syntactic sugar for) will provide for the value the second element of the split array instead of the array itself. Thus giving RDD[(String, String)] during this second step rather than RDD[(String, Array[String])].

Upvotes: 2

Related Questions