Jake Wilko
Jake Wilko

Reputation: 45

Java - Extract html information from string

All of the guides out there tell me on how to remove the HTML tags from the text to extract the text between them. What I am after is the extraction of the data that is within the HTML tags.

e.g.

If i have a string:

 "<FONT SIZE="5">Hello World</FONT>"

I want to get the font size information to update other variables. How do I go about this?

Upvotes: 1

Views: 334

Answers (4)

pap
pap

Reputation: 27604

I've used jsoup several times for this purpose. It's a lenient HTML parser. Beware trying to parse it as "standard" XML as XML-parsing is strict by nature and will fail if the page does not conform to XML markup specs (which few HTML pages do).

Upvotes: 2

Tator
Tator

Reputation: 556

You can use a library like jerichoHTML wich enables you to search for HTML tags as well as their attributes or you build some DOM on you own.

Upvotes: 0

Martin Green
Martin Green

Reputation: 1054

You go about this by using one of the available Java libraries for HTML parsing, like TagSoup.

Upvotes: 1

romedius
romedius

Reputation: 775

Take a look at this: http://en.wikipedia.org/wiki/Java_API_for_XML_Processing If you parse the HTML you should be able to extract the values from the DOM tree.

Upvotes: -1

Related Questions