Joyfulvillage
Joyfulvillage

Reputation: 557

Scala XML parsing with real-world HTML (with unmatched tag)

My application is trying to embed an html document into an XML document.

val xml = 
  <document>
    <id> { getId } </id>
    <content> 
      { getContent }
    </content>
  </document>

getId is a simple function to return a new sequence number. The issue is on getContent:

def getContent = {
  val wrapped = "<wrap>"+article.content+"</wrap>"
  XML.loadString(wrapped).child
}

As you may see, article.content return a String that stored the real-world HTML document. The Scala.xml.XML.loadString function would parse it into XML and return a list of child and embeded into the xml val correctly.

However, this is working when only the html is valid, e.g. <body>Hello world</body>

In some of the article, it may appear: <body><strong>Hello world</body> which lacking a closing tag of <strong> elem. (Yes, I can't just blame the user!)

In this case, it will throw an exception on this parsing and stop the application.

Is there any way I can either bypass the validation or simply embed the HTML as string within the XML document without parsing?

Please shed some light on this situation. Any suggestions are welcomed.

Upvotes: 0

Views: 320

Answers (1)

Kevin Wright
Kevin Wright

Reputation: 49705

Both JSoup and TagSoup (amongst others) are suitable for passing HTML that isn't also well-formatted XML.

You'll have to decide which is best for your own use-case.

Upvotes: 2

Related Questions