saurabh
saurabh

Reputation: 257

How to parse a Html and get the result as a String using Java

I want to Parse a Html and get the result as a string. Given that the Body of the Outer Html contains another Html String, I want that inner Html as output String.

Example> Input HTML:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html><head></head><body><p>&lt;!DOCTYPE html&gt;<br />&lt;html&gt;<br />&lt;body&gt;<br /><br />&lt;h1&gt;My First Heading&lt;/h1&gt;<br /><br />&lt;p&gt;My first paragraph.&lt;/p&gt;<br /><br />&lt;/body&gt;<br />&lt;/html&gt;<br /><br /></p></body></html>

Output String :

<!DOCTYPE html><html><body><h1>My First Heading</h1><p>My first paragraph.</p></body></html>

Important : I am using a HTML editor in which if I input something, it returns the HTML represantation for that Input on doing getText, the first Html String above is that representation only.

Also the output string should be same as when I run the first String here(http://www.w3schools.com/html/tryit.asp?filename=tryhtml_basic)

Please help me with this.

Upvotes: 0

Views: 469

Answers (1)

Vyncent
Vyncent

Reputation: 1205

i would go with some regexp :

(<!DOCTYPE html>).*(<html>.*</html>).+

And taking group 1 and group 2,

    tst = tst.replaceAll("<", "<").replaceAll(">",">");
    Pattern p = Pattern.compile("(<!DOCTYPE html>).*(<html>.*</html>).*</html>.*");
    Matcher m = p.matcher(tst);
    m.find();
    System.out.println(m.group(1) + m.group(2));

exemple runnning : http://rextester.com/JTOJ89529

Upvotes: 1

Related Questions