Reputation: 71
I'm a beginner on Scala and RDD. I'm using Scala on Spark 2.4. I have a RDD[String] with lines like that:
(a, b, c, d, ...)
I would like to split this String at each coma to get an RDD[(String, String, String, ...)]
.
Solutions like the following are obviously not possible regarding the number of elements.
rdd.map(x => (x.split(",")(0), x.split(",")(1), x.split(",")(2)))
May be is there a way to automate that? Everything working would be fine.
Despite my efforts, I have no solution to my issue so far,
Thanks a lot!
Upvotes: 0
Views: 1608
Reputation: 74
Note that the maximum tuple size is limited to 22, so it won't be so long to list them all ...
By the way, in the book Spark in Action, on page 110, it wrotes:
There's no elegant way to convert an array to a tuple, so you have to resort to this ugly expression:
scala> val itPostsRDD = itPostsSplit.map(x => (x(0), x(1), x(2), x(3), x(4), x(5), x(6), x(7), x(8), x(9), x(10), x(11), x(12))
itPostsRDD: org.apache.spark.rdd.RDD[(String, String, ...
Upvotes: 2
Reputation: 22595
One solution is to just write the mapping function:
def parse(s: String) = s.split(",") match {
case Array(a,b,c) => (a,b,c)
}
parse("x,x,x") // (x,x,x)
You could write the more generic solution using shapeless:
def toTuple[H <: HList](s: String)(implicit ft: FromTraversable[H], t: Tupler[H]) = s.split(",").toHList[H].get.tupled
then you can use it directly:
toTuple[String :: String :: String :: HNil]("x,x,x") // (x,x,x)
toTuple[String :: String :: HNil]("x,x") // (x,x)
or fix then type and then use it:
def parse3(s: String) = toTuple[String :: String :: String :: HNil](s)
parse3("x,x,x") // (x,x,x)
Upvotes: 3
Reputation: 22840
If the number of elements is fixed, you can do something like:
val tuples =
rdd
.map(line => line.replaceAll("[\\(\\)]", "").split(","))
.collect {
case Array(col1, col2, ..., coln) => (col1, col2, ..., coln)
}
// tuples: RDD[(String, String, ..., String)]
Upvotes: 3