Parsing Webpage

Question

I would like to parse a website and get some information from it. The problem is, when I load the page in java and save it into a file, it doesn't contain the information I need. When I click view source on the page there is no information either. However, when I download the page(save as), and open it with a notepad I am able to find what I need.

In short- the webpage, that java loads differs from the one I download and open with notepad.

How do I load the page into string so that it looks the same as the one I download on my computer?

public static void main(String[] args) {

    try {
        String webPage = "http://www.integral-calculator.com/#";
        URL url = new URL(webPage);
        URLConnection urlConnection = url.openConnection();
        InputStream is = urlConnection.getInputStream();
        InputStreamReader isr = new InputStreamReader(is);

        int numCharsRead;
        char[] charArray = new char[1024];
        StringBuffer sb = new StringBuffer();
        while ((numCharsRead = isr.read(charArray)) > 0) {
            sb.append(charArray, 0, numCharsRead);
        }
        String result = sb.toString();

        PrintWriter out = new PrintWriter("C:\Users\Patryk\Desktop\filename.txt");
        out.println(result);
        out.close();
    } catch (MalformedURLException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }
}

peter_the_oak · Accepted Answer

Once a browser has loaded the start page, e.g. index.html, it'll attempt to load and parse further content: CSS files, Javascript files, multimedia files and more. Then, as the events are fired, all the Javascript is run and may load much further content.

So it is thinkable that the majority of a webpage content is loaded in second steps, additionally. If you download only the start page with a URLConnection as you do in your code snippet, you will only receive the very first startup frame without additional content.

If you think about this, you realize that one single and simple URLConnection is far away from the powerful behaviour of a browser. Between the URLConnection and the browser lies the HTTPClient. For all those levels, you'll find Java libraries with more or less complex behaviour, so getting more or less content.

In this following thread, the Apache Java HTTPClient is mentioned:

Equivallent of .NET's WebClient and HttpWebRequest in Java?

And in this thread, the Java HTMLUnit is mentioned. It is available of loading websites nearly completely and of parsing much Javascript:

Apache HttpClient 4 And JavaScript

If you would use HTMLUnit, you would find ways to download most of your webpage, including the content that is additionally loaded. Then, you wouldn't see much difference between the webpage grabbed by you or by the browser.

--

One other approach to grab webpages is to involve the wget command in a shell execution. wget can recursively download websites with additional contents and file structures, and it stores them to the disk.

Simply open a shell and try wget -E -H -k -K -p http://www.garfield.com. This will download the full linked philosophical cat's content.

Parsing Webpage

Answers (1)

Related Questions