Reputation: 21

html content extraction using htmlunit

I have series of HTML files with the same structures.

Let take this example code.

>     <html>
>     <head>
>     <title>main page</title>
>     </head>
>     <body>
>     <table><tr>
>     <td>content1</td>
>     </tr></table>
>     </body>
>     </html>

I want to extract the title tag content and td tag content. How to do this using htmlunit? I am new to htmlunit. Please help me.

Upvotes: 0

Answers (2)

Urs Reupke

Reputation: 6921

See this instructive snippet from the HTMLUnit page.

In there you first construct a client, then retrieve your page, finally ask for the title text (page.getTitleText()), or get the entire page as a HTML String (page.asXml()). You could then assertContains on that string.

There are plenty of other options, like retrieving elements by id. Best see the examples for yourself.

Upvotes: 1

Mike Samuel

Reputation: 120496

htmlunit is a testing system. Not a DOM parser.

To parse HTML to a DOM use http://about.validator.nu/htmlparser/ and use the HtmlDocumentBuilder class.

Once you have a Document you can do myDocument.getElementsByTagName("title") to find the title element.

Upvotes: 0

html content extraction using htmlunit

Answers (2)

Related Questions