Reputation: 5302
Further to my earlier question here: Extending a basic web crawler to filter status codes and HTML , I'm trying to extract information from HTML tags, in this case "title", with the following method:
public static void parsePage() throws IOException, BadLocationException
{
HTMLEditorKit kit = new HTMLEditorKit();
HTMLDocument doc = (HTMLDocument) kit.createDefaultDocument();
doc.putProperty("IgnoreCharsetDirective", Boolean.TRUE);
Reader HTMLReader = new InputStreamReader(testURL.openConnection()
.getInputStream());
kit.read(HTMLReader, doc, 0);
// Create an iterator for all HTML tags.
ElementIterator it = new ElementIterator(doc);
Element elem;
while ((elem = it.next()) != null)
{
if (elem.getName().equals("title"))
{
System.out.println("found title tag");
}
}
}
This is working as far as telling me it's found the tags. What I'm struggling with is how to extract the information contained after/within them.
I found this question on the site: Help with Java Swing HTML parsing , however it states it will only work with well-formed HTML. I was hoping there is another way.
Any pointers appreciated.
Upvotes: 1
Views: 2745
Reputation: 5302
Turns out changing the method to this produces the desired result:
{
HTMLEditorKit kit = new HTMLEditorKit();
HTMLDocument doc = (HTMLDocument) kit.createDefaultDocument();
doc.putProperty("IgnoreCharsetDirective", Boolean.TRUE);
Reader HTMLReader = new InputStreamReader(testURL.openConnection().getInputStream());
kit.read(HTMLReader, doc, 0);
String title = (String) doc.getProperty(Document.TitleProperty);
System.out.println(title);
}
I think I was off on a wild goose chase with iterator/element stuff.
Upvotes: 1
Reputation: 2425
Try using Jodd
Jerry jerry = jerry().enableHtmlMode().parse(html);
...
Or HtmlParser
Parser parser = new Parser(htmlInput);
CssSelectorNodeFilter cssFilter = new CssSelectorNodeFilter("title");
NodeList nodes = parser.parse(cssFilter);
Upvotes: 3