Reputation: 11
Basically, I have two sets of data in two text files. One set of data is in the format:
a,DataString1,DataString2 (One line) (The first character is in every entry but not relevant)
.... (and so on)
The second set of data is in format:
Data, Data Data Data, Data Data, Data, Data Data Data (One line)(separated by either commas or spaces, but I'm able to use a regular expression to handle this, so that's not the problem)
.... (And so on)
So what I need to do is check if DataString1 AND DataString2 are both present on any single line of the second set of data.
Currently I'm doing this like so:
// spark context is defined above, imported java.util.regex.Pattern above as well
case class test(data_one: String, data_two: String)
// case class is used to just more simply organize data_one to work with
val data_one = sc.textFile("path")
val data_two = sc.textFile("path")
val rdd_one = data_one.map(_.split(",")).map( c => test(c(1),c(2))
val rdd_two = data_two.map(_.split("[,\\s*]"))
val data_two_array = rdd_two.collect()
// this causes data_two_array to be an array of array of strings.
one.foreach { line =>
for (array <- data_two_array) {
for (string <- array) {
// comparison logic here that checks finds if both dataString1 and dataString2
// happen to be on same line is in these two for loops
}
}
}
How could I make this process more efficient? At the moment it does work correctly, but as data sizes grow this becomes very ineffective.
Upvotes: 1
Views: 119
Reputation: 7148
The double for loop scans for all elements with size m*n where m,n are sizes of each set. You can start with join to eliminate rows. Since you have 2 columns to verify, make sure the join takes care of those.
Upvotes: 1