GameOfThrows
GameOfThrows

Reputation: 4510

scala regex in a paired RDD

I have a question regarding regex in RDD operations in Scala/Eclipse/Spark.

I have 2 data files which I have parsed, and joined together to form a RDD with paired [URL RegexOfURL], they look something like

(http://coach.nationalexpress.com/nxbooking/journey-list,
(^https://www\.nationalexpress\.com/bps/confirmation\.cfm\?id=|^https://coach\.nationalexpress\.com/nxbooking/delivery-details))

I wish to run an operation such that each URL (the first part) is matched to the regex (the second part). If the RegEx match, flag it with a true flag, else flag it false

I have tried writing a function:

def operation(s1:RDD[String], s2:RDD[String]) = 
s1 match{
case s2 => 't'
case _ => 'f'
}

but the match is not what I want, I want to use the regex correctly, and is having trouble.

I also tried to break the RDD into each line and running a function with no success. What would you suggest is the best way to do this?

Thanks in advance

Upvotes: 1

Views: 403

Answers (1)

maasg
maasg

Reputation: 37435

Given the input data is an RDD of pairs (string, regex), where the regex is in String form: RDD[(String,String)] then this transformation should look something like this:

val urlMatchRegexRdd = urlRegexPairsRDD.map{case (url, regex) => url match {
    regex.r(_ *) => ((url, regex), true)
    _ => ((url, regex), false)
}

This will result in an RDD of the form RDD[((String, String),Boolean)] preserving the original information with the added regex match result.

Upvotes: 1

Related Questions