Saurabh
Saurabh

Reputation: 2472

Extracting contents from HTML represented as a String

I have a Big html in String variable and I want to get contents of a div. I can not rely on regular expression because it can have nested div's. So, let's suppose I have following String -

String test = "<div><div id=\"mainContent\">foo bar<div>good best better</div>  <div>test test</div></div><div>foo bar</div></div>";

Then how can I get this with a simple java program -

<div id="mainContent">foo bar<div>good best better</div>  <div>test test</div></div>

Well my approch is something like this (might be horrable, still fighting to correct) -

public static void main(String[] args) {
            int count = 1;
        int fl = 0;
        String s = "<div><div id=\"mainContent\">foo bar<div>good best better</div>  <div>test test</div></div><div>foo bar</div></div>";
        String tmp = s;
        int len = s.length();
        for (int i=0; i<len; i++){
            int st = s.indexOf("div>");
            if(st > -1) {
                char c = s.charAt(st-1);
                if(c == '/') {
                    count--; 
                } else {
                    count++;
                }
                s = s.substring(st+4);
                System.out.println(s);
                i = i + st;
                System.out.println(c + " -- " + st + " -- " + count + " -- " + i);  
                if (count == 0) {
                    fl = i;
                    break;
                }
            }
        }
        System.out.println("final ind - " + fl);
        s = tmp.substring(0, fl + 4);
        System.out.println("final String - " + s);
}

Upvotes: 0

Views: 206

Answers (2)

user177800
user177800

Reputation:

I would recommend using JSoup to parse the HTML and find what you are looking for.

It fulfills the simple requirement for sure. You can do what you want in just a couple of lines of code!

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.

jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do.

scrape and parse HTML from a URL, file, or string

find and extract data, using DOM traversal or CSS selectors

jsoup is designed to deal with all varieties of HTML found in the wild; from pristine and validating, to invalid tag-soup; jsoup will create a sensible parse tree.

Using the selector syntax makes finding and extracting data extremely simple.

public static void main(final String[] args)
{
    final String s = "<div><div id=\"mainContent\">foo bar<div>good best better</div>  <div>test test</div></div><div>foo bar</div></div>";
    final Document d = Jsoup.parse(s);
    final Elements e = d.select("#mainContent");
    System.out.println(e.get(0));
}

outputs

  <div id="mainContent">
   foo bar
   <div>
    good best better
   </div> 
   <div>
    test test
   </div>
  </div>

Doesn't get much more simple than that!

Upvotes: 2

nfechner
nfechner

Reputation: 17535

I'm afraid the answer is: You don't. At least not with a "simple" program...

But there is hope: You can use a HTML parser library (like NekoHTML or HTMLParser, although the latter project seems to be dead) to parse the string and retrive the part you need.

Upvotes: 0

Related Questions