user1219266
user1219266

Reputation:

extract text from HTML segment using standard java

I'm receiving a segment of HTML document as Java String and i would like to extract it's inner text. for ex: hello world ----> hello world

is there a way to extract the text using java standard library ? something maybe more efficient than open/close tag regex with empty string? thanks,

Upvotes: 0

Views: 1307

Answers (4)

Konrad Reiche
Konrad Reiche

Reputation: 29543

Don't use regular expression to parse HTML, use for instance jsoup: Java HTML Parser. It has a convenient way to select elements from the DOM.

Example Fetch the Wikipedia homepage, parse it to a DOM, and select the headlines from the In the news section into a list of Elements:

Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");

There is also a HTML parser in the JDK: javax.swing.text.html.parser.Parser, which could be applied like this:

Reader in = new InputStreamReader(new URL(webpageURL).openConnection().getInputStream());
ParserDelegator parserDelegator = new ParserDelegator();
parserDelegator.parse(in, harvester, true);

Then, dependent on what kind you are looking for: start tags, end tags, attributes, etc. you define the appropriate callback function:

@Override
public void handleStartTag(HTML.Tag tag,
        MutableAttributeSet mutableAttributeSet, int pos) {

    // parses the HTML document until a <a> or <area> tag is found
    if (tag == HTML.Tag.A || tag == HTML.Tag.AREA) {

        // reading the href attribute of the tag
        String address = (String) mutableAttributeSet
                .getAttribute(Attribute.HREF);

    /* ... */

Upvotes: 2

amicngh
amicngh

Reputation: 7899

You can use HTMLParser , this is a open source.

Upvotes: 1

hsz
hsz

Reputation: 152294

I will also say it - don't use regex with HTML. ;-)

You can give a shot with JTidy.

Upvotes: 2

Denys S&#233;guret
Denys S&#233;guret

Reputation: 382454

Don't use regex to parse HTML but a dedicated parser like HtmlCleaner.

Using a regex will usually work at fist test, and then start to be more and more complex until it ends being impossible to adapt.

Upvotes: 2

Related Questions