(Spark/Scala) What would be the most effective way to compare specific data in one RDD to a line of another?

Question

Basically, I have two sets of data in two text files. One set of data is in the format:

a,DataString1,DataString2 (One line) (The first character is in every entry but not relevant)

.... (and so on)

The second set of data is in format:

Data, Data Data Data, Data Data, Data, Data Data Data (One line)(separated by either commas or spaces, but I'm able to use a regular expression to handle this, so that's not the problem)

.... (And so on)

So what I need to do is check if DataString1 AND DataString2 are both present on any single line of the second set of data.

Currently I'm doing this like so:

// spark context is defined above, imported java.util.regex.Pattern above as well
case class test(data_one: String, data_two: String)
// case class is used to just more simply organize data_one to work with
val data_one = sc.textFile("path")
val data_two = sc.textFile("path")

val rdd_one = data_one.map(_.split(",")).map( c => test(c(1),c(2))
val rdd_two = data_two.map(_.split("[,\s*]"))
val data_two_array = rdd_two.collect()
// this causes data_two_array to be an array of array of strings.
one.foreach { line =>
    for (array <- data_two_array) {

        for (string <- array) {
            // comparison logic here that checks finds if both dataString1 and dataString2
            // happen to be on same line is in these two for loops
        }
    }
}

How could I make this process more efficient? At the moment it does work correctly, but as data sizes grow this becomes very ineffective.

(Spark/Scala) What would be the most effective way to compare specific data in one RDD to a line of another?

Answers (1)

Related Questions