Reputation: 45
All of the guides out there tell me on how to remove the HTML tags from the text to extract the text between them. What I am after is the extraction of the data that is within the HTML tags.
e.g.
If i have a string:
"<FONT SIZE="5">Hello World</FONT>"
I want to get the font size information to update other variables. How do I go about this?
Upvotes: 1
Views: 334
Reputation: 27604
I've used jsoup several times for this purpose. It's a lenient HTML parser. Beware trying to parse it as "standard" XML as XML-parsing is strict by nature and will fail if the page does not conform to XML markup specs (which few HTML pages do).
Upvotes: 2
Reputation: 556
You can use a library like jerichoHTML wich enables you to search for HTML tags as well as their attributes or you build some DOM on you own.
Upvotes: 0
Reputation: 1054
You go about this by using one of the available Java libraries for HTML parsing, like TagSoup.
Upvotes: 1
Reputation: 775
Take a look at this: http://en.wikipedia.org/wiki/Java_API_for_XML_Processing If you parse the HTML you should be able to extract the values from the DOM tree.
Upvotes: -1