user1050325
user1050325

Reputation: 1272

Iterative lookup within map

def description(list:Array[String]): Array[String] = {
  for (y <- list) yield modulelookup.lookup(take(4)) + " " + brandlookup.lookup(y.drop(4)).toString()
}

val printRDD = outputRDD.collect().map(x=> (description(x._1),x._2))

is my current code. I would like to do this without collect. modulelookup and brandlookup are RDDs. How to do this?

Upvotes: 0

Views: 89

Answers (1)

zero323
zero323

Reputation: 330303

If modulelookup and brandlookup are relatively small you can convert these to broadcast variables and use for mapping as follows:

val modulelookupBD = sc.broadcast(modulelookup.collectAsMap)
val brandlookupBD = sc.broadcast(brandlookup.collectAsMap)

def description(list:Array[String]): Array[String] = list.map(x => {
  val module =  modulelookupBD.value.getOrElse(x.take(4), "")
  val brand  = brandlookupBD.value.getOrElse(x.drop(4), "")
  s"$module $brand"
})

val printRDD = outputRDD.map{case (xs, y) => (description(xs), y)}

If not there is no efficient way of handling this. You can try to flatMap, join and groupByKey but for any large dataset this combination can be prohibitively expensive.

val indexed = outputRDD.zipWithUniqueId
val flattened = indexed.flatMap{case ((xs, _), id) => xs.map(x => (x, id))}

val withModuleAndBrand = flattened
  .map(xid => (xid._1.take(4), xid))
  .join(modulelookup)
  .values
  .map{case ((x, id), module) => (x.drop(4), (id, module))}
  .join(brandlookup)
  .values
  .map{case ((id, module), brand) => (id, s"$module $brand")}
  .groupByKey

val final = withModuleAndBrand.join(
  indexed.map{case ((_, y), id) => (id, y)}
).values

Replacing RDDs with DataFrames can cut down on boilerplate code but performance will stay a problem.

Upvotes: 2

Related Questions