Reputation: 4510
I have a question regarding regex in RDD operations in Scala/Eclipse/Spark.
I have 2 data files which I have parsed, and joined together to form a RDD with paired [URL RegexOfURL], they look something like
(http://coach.nationalexpress.com/nxbooking/journey-list,
(^https://www\.nationalexpress\.com/bps/confirmation\.cfm\?id=|^https://coach\.nationalexpress\.com/nxbooking/delivery-details))
I wish to run an operation such that each URL (the first part) is matched to the regex (the second part). If the RegEx match, flag it with a true flag, else flag it false
I have tried writing a function:
def operation(s1:RDD[String], s2:RDD[String]) =
s1 match{
case s2 => 't'
case _ => 'f'
}
but the match is not what I want, I want to use the regex correctly, and is having trouble.
I also tried to break the RDD into each line and running a function with no success. What would you suggest is the best way to do this?
Thanks in advance
Upvotes: 1
Views: 403
Reputation: 37435
Given the input data is an RDD of pairs (string, regex)
, where the regex
is in String
form: RDD[(String,String)]
then this transformation should look something like this:
val urlMatchRegexRdd = urlRegexPairsRDD.map{case (url, regex) => url match {
regex.r(_ *) => ((url, regex), true)
_ => ((url, regex), false)
}
This will result in an RDD of the form RDD[((String, String),Boolean)]
preserving the original information with the added regex match result.
Upvotes: 1