Reputation: 391
I have a RDD[Long,String]. A sample RDD is as follows -
(123, name:abc,sr.no:1,name:def,sr.no:2)
I want to transform this rdd to have a list of sr.no. Output should look like this-
(123, [1,2])
I tried this in scala with the flatmap approach, but I want only 1 record for "123" and all the values within an array.
Upvotes: 0
Views: 2068
Reputation: 214927
You can use regex
to extract the digits after sr.no:
with look-behind syntax (?<=)
:
val p = "(?<=sr.no:)\\d+".r
# p: scala.util.matching.Regex = (?<=sr.no:)\d+
rdd.map{case (x, y) => (x, p.findAllIn(y).toList)}.collect()
# res10: Array[(Int, List[String])] = Array((123,List(1, 2)))
Or as @Tim commented, use mapValues()
:
rdd.mapValues(p.findAllIn(_).toList).collect()
# res11: Array[(Int, List[String])] = Array((123,List(1, 2)))
Upvotes: 0
Reputation: 3725
You'll maintain the number of records if you use mapValues
. Here is a naive function that does what you want:
scala> def foo(s: String, pattern: String): Array[String] = s.split(",").filter(_.contains(pattern)).map(_.split(":").last)
foo: (s: String)Array[String]
scala> foo("name:abc,sr.no:1,name:def,sr.no:2", "sr.no")
res3: Array[String] = Array(1, 2)
Now you can call:
rdd.mapValues(foo(_, "sr.no")
Upvotes: 2