Scala Spark sort RDD by index of substring

Question

I'm new to Scala Spark and I have a question.

I have RDD that contains 120 million strings, I'm trying to find all string that contains the sub-string. this I'm doing fine.

Now I want to sort the output by the index so that found string that the sub-string is closer to the start will be first.

For example:

The sub-string: abcdefg

The strings:

s1 = tryuabcdefgyui

s2 = trabcdefgyui

s3 = abcdefgyuo

So my desired output should be a list\rdd that is sorted {s3, s2, s1}

What is the best way of doing so?

marios · Accepted Answer

The idea is to transform the RDD[String] to RDD[(String,Index)] where the Index is calculated using Java's String indexOf.

// Dataset
val r = sc.makeRDD(Seq("abf", "ffff", "aaaaaabf", "ttggabf"))

// Sorting on index of substring "bf", only for those strings that contain "bf"
val sorted = r.map(s => (s, s.indexOf("bf"))).filter(_._2>0).sortBy(_._2)

Scala Spark sort RDD by index of substring

Answers (1)

Related Questions