Reputation: 3812
I'm attempting to extract urls from a string, they aren't standardized so some are within href tags, others on their own.
Also I need them to be sorted by type, so for example the following strings:
var txt1: String = "Some text! <a href="http://www.google.com/test.mp3">MP3</a>"
var txt2: String = "Some text! <a href="http://www.google.com/test.jpg">IMG</a>"
var txt3: String = "Some more! <a href="http://www.google.com/">Link!</a>"
So these strings are all concatenated and contain 3 urls, I'm looking for something along the lines of:
var result: List = List(
"mp3" -> List("http://www.google.com/test.mp3"),
"img" -> List("http://www.google.com/test.jpg"),
"url" -> List("http://www.google.com/")
)
I've looked into regex but have only go so far as to extract hrefs without defining types, and this also doesn't retrieve urls on their own outside of tags
val hrefRegex = new Regex("""\<a.*?href=\"(http:.*?)\".*?\>.*?\</a>""");
val hrefs:List[String]= hrefRegex.findAllIn(txt1.mkString).toList;
Any help is much appreciated, thanks in advance :)
Upvotes: 3
Views: 1218
Reputation: 41646
Assuming val txt = txt1 + txt2 + txt3
, you can wrap the text into an xml element as a string then parse it as XML and use the xml standard library to extract the anchors.
// can do other cleanup if necessary here such as changing "link!"
def normalize(t: String) = t.toLowerCase()
val txtAsXML = xml.XML.loadString("<root>" + txt + "</root>")
val anchors = txtAsXML \\ "a"
// returns scala.xml.NodeSeq containing the <a> tags
Then you just need to post process until you have the data organized like you want:
val tuples = anchors.map(a => normalize(a.text) -> a.attributes("href").toString)
// Seq[String, String] containing elements
// like "mp3" -> http://www.google.com/test.mp3
val byTypes = tuples.groupBy(_._1).mapValues(seq => seq.map(_._2))
// here grouped by types:
// Map(img -> List(http://www.google.com/test.jpg),
// link! -> List(http://www.google.com/),
// mp3 -> List(http://www.google.com/test.mp3))
Upvotes: 5