newbie
newbie

Reputation: 391

Splitting a string in spark RDD

I have a RDD[Long,String]. A sample RDD is as follows -

(123, name:abc,sr.no:1,name:def,sr.no:2)

I want to transform this rdd to have a list of sr.no. Output should look like this-

(123, [1,2])

I tried this in scala with the flatmap approach, but I want only 1 record for "123" and all the values within an array.

Upvotes: 0

Views: 2068

Answers (2)

akuiper
akuiper

Reputation: 214927

You can use regex to extract the digits after sr.no: with look-behind syntax (?<=):

val p = "(?<=sr.no:)\\d+".r
# p: scala.util.matching.Regex = (?<=sr.no:)\d+

rdd.map{case (x, y) => (x, p.findAllIn(y).toList)}.collect()
# res10: Array[(Int, List[String])] = Array((123,List(1, 2)))

Or as @Tim commented, use mapValues():

rdd.mapValues(p.findAllIn(_).toList).collect()
# res11: Array[(Int, List[String])] = Array((123,List(1, 2)))

Upvotes: 0

Tim
Tim

Reputation: 3725

You'll maintain the number of records if you use mapValues. Here is a naive function that does what you want:

scala> def foo(s: String, pattern: String): Array[String] = s.split(",").filter(_.contains(pattern)).map(_.split(":").last)
foo: (s: String)Array[String]

scala> foo("name:abc,sr.no:1,name:def,sr.no:2", "sr.no")
res3: Array[String] = Array(1, 2)

Now you can call:

rdd.mapValues(foo(_, "sr.no")

Upvotes: 2

Related Questions