Reputation: 1510
I'm new to Scala Spark and I have a question.
I have RDD that contains 120 million strings, I'm trying to find all string that contains the sub-string. this I'm doing fine.
Now I want to sort the output by the index so that found string that the sub-string is closer to the start will be first.
For example:
The sub-string: abcdefg
The strings:
s1 = tryuabcdefgyui
s2 = trabcdefgyui
s3 = abcdefgyuo
So my desired output should be a list\rdd that is sorted {s3, s2, s1}
What is the best way of doing so?
Upvotes: 3
Views: 1049
Reputation: 8996
The idea is to transform the RDD[String]
to RDD[(String,Index)]
where the Index is calculated using Java's String indexOf
.
// Dataset
val r = sc.makeRDD(Seq("abf", "ffff", "aaaaaabf", "ttggabf"))
// Sorting on index of substring "bf", only for those strings that contain "bf"
val sorted = r.map(s => (s, s.indexOf("bf"))).filter(_._2>0).sortBy(_._2)
Upvotes: 4