Reputation: 937
I've read a lot of questions on stackoverflow regarding html parsing. I've learned that, when possible, we should avoid regex and use a parser instead. I know that there are a lot of Html/Xml parser but I don't know how to use them properly.
Consider this html, parsed through jTidy. I've got a Document object created by jTidy of this code:
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<!-- Header content -->
</head>
<body>
<div id="container">
<div id="id1"> ... </div>
<div id="id2"> ... </div>
<div id="mainContent">
<div id="section 1">
<div id="subSection">
<!-- Interested part -->
<tbody>
<tr class="success">
<td class="fileName"><span>File One</span></td>
</tr>
<tr class="fail">
<td class="fileName"><span>File Two</span></td>
</tr>
<tr class="success">
<td class="fileName"><span>File Three</span></td>
</tr>
</tbody>
</div>
</div>
</div>
</div>
</body>
Now, I would like to map (in a Map :D ) each filename with its class (success/fail). I can do it with DOM, but I should create a NodeList and for each Element create a new nodelist (lots of memory and boring). There are alternatives like Sax, Xerces etc etc. but I don't know advantages/disadvantages of them.
What is the simplest (and fastest) way to extract those information from the "jTyded" html above?
Upvotes: 1
Views: 2182
Reputation: 2273
First of all - you forgot to add <table>
tag.
You can very easy parse you code with Jsoup
Here is an example:
// String html =" ...here goes your html code... ";
// Document doc = Jsoup.parse(html);
// Or from file:
File input = new File("com.htm");
Document doc = Jsoup.parse(input, "UTF-8");
Elements trs = doc.select("tr"); //select all "tr" elements from document
for(Element tr:trs){
//Getting the class string form tr element
System.out.println("The file class is: " + tr.attr("class")
//getting the filename string that holds inside td element
+ " The filamee is: " + tr.select("td").text());
}
}
Upvotes: 1
Reputation: 9139
In my opinion the best approach would be to use XSLT+XPath (as Greg suggested in comment) in order to produce input for unmarshaller.
So the entire flow looks like below: HTML->[jTidy purifying]->XHTL->[XSLT transformation]->string data representation->[JAXB unmarshaller]->Java object(s).
If you don't want to have objects produced, use only XPath as described in this thread: How to read XML using XPath in Java
Upvotes: 0