wn125
wn125

Reputation: 11

(Spark/Scala) What would be the most effective way to compare specific data in one RDD to a line of another?

Basically, I have two sets of data in two text files. One set of data is in the format:

a,DataString1,DataString2 (One line) (The first character is in every entry but not relevant)

.... (and so on)

The second set of data is in format:

Data, Data Data Data, Data Data, Data, Data Data Data (One line)(separated by either commas or spaces, but I'm able to use a regular expression to handle this, so that's not the problem)

.... (And so on)

So what I need to do is check if DataString1 AND DataString2 are both present on any single line of the second set of data.

Currently I'm doing this like so:

// spark context is defined above, imported java.util.regex.Pattern above as well
case class test(data_one: String, data_two: String)
// case class is used to just more simply organize data_one to work with
val data_one = sc.textFile("path")
val data_two = sc.textFile("path")

val rdd_one = data_one.map(_.split(",")).map( c => test(c(1),c(2))
val rdd_two = data_two.map(_.split("[,\\s*]"))
val data_two_array = rdd_two.collect()
// this causes data_two_array to be an array of array of strings.
one.foreach { line =>
    for (array <- data_two_array) {

        for (string <- array) {
            // comparison logic here that checks finds if both dataString1 and dataString2
            // happen to be on same line is in these two for loops
        }
    }
}

How could I make this process more efficient? At the moment it does work correctly, but as data sizes grow this becomes very ineffective.

Upvotes: 1

Views: 119

Answers (1)

Ramzy
Ramzy

Reputation: 7148

The double for loop scans for all elements with size m*n where m,n are sizes of each set. You can start with join to eliminate rows. Since you have 2 columns to verify, make sure the join takes care of those.

Upvotes: 1

Related Questions