Extracting contents from HTML represented as a String

Question

I have a Big html in String variable and I want to get contents of a div. I can not rely on regular expression because it can have nested div's. So, let's suppose I have following String -

String test = "foo bargood best better
  test test
foo bar";

Then how can I get this with a simple java program -

foo bargood best better
  test test

Well my approch is something like this (might be horrable, still fighting to correct) -

public static void main(String[] args) {
            int count = 1;
        int fl = 0;
        String s = "foo bargood best better
  test test
foo bar";
        String tmp = s;
        int len = s.length();
        for (int i=0; i");
            if(st > -1) {
                char c = s.charAt(st-1);
                if(c == '/') {
                    count--; 
                } else {
                    count++;
                }
                s = s.substring(st+4);
                System.out.println(s);
                i = i + st;
                System.out.println(c + " -- " + st + " -- " + count + " -- " + i);  
                if (count == 0) {
                    fl = i;
                    break;
                }
            }
        }
        System.out.println("final ind - " + fl);
        s = tmp.substring(0, fl + 4);
        System.out.println("final String - " + s);
}

user177800 · Accepted Answer

I would recommend using JSoup to parse the HTML and find what you are looking for.

It fulfills the simple requirement for sure. You can do what you want in just a couple of lines of code!

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.

jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do.

scrape and parse HTML from a URL, file, or string

find and extract data, using DOM traversal or CSS selectors

jsoup is designed to deal with all varieties of HTML found in the wild; from pristine and validating, to invalid tag-soup; jsoup will create a sensible parse tree.

Using the selector syntax makes finding and extracting data extremely simple.

public static void main(final String[] args)
{
    final String s = "foo bargood best better
  test test
foo bar";
    final Document d = Jsoup.parse(s);
    final Elements e = d.select("#mainContent");
    System.out.println(e.get(0));
}

outputs

  
   foo bar
   
    good best better
    
   
    test test

Doesn't get much more simple than that!

Extracting contents from HTML represented as a String

Answers (2)

Related Questions