Gemma
Gemma

Reputation: 37

A question on webpage representation in Java

I've followed a tutorial and came up with the following method to read the webpage content into a CharSequence

 public static CharSequence getURLContent(URL url) throws IOException {
       URLConnection conn = url.openConnection();
       String encoding = conn.getContentEncoding();
       if (encoding == null) {
         encoding = "ISO-8859-1";
       }
       BufferedReader br = new BufferedReader(new
           InputStreamReader(conn.getInputStream(),encoding));
       StringBuilder sb = new StringBuilder(16384);
       try {
         String line;
         while ((line = br.readLine()) != null) {
           sb.append(line);
           sb.append('\n');
         }
       } finally {
         br.close();
       }
       return sb;
     }

It will return a representation of the webpage specified by the url. However,this representation is hugely different from what I use "view page source" in my Firefox,and since I need to scrape data from the original webpage(some data segement in the original "view page source" file),it will always fail to find required text on this Java representation. Did I go wrong somewhere?I need your advice guys,thanks a lot for helping!

Upvotes: 0

Views: 124

Answers (2)

objects
objects

Reputation: 8677

Things like the request useragent and cookies can change what the server returns in the response. So the problem is more likely in the details of the request you are sending rather than in how you are reading the response.

Things like HttpClient will allow you to more easily simulate the request being sent from a browser.

Upvotes: 1

Itay Maman
Itay Maman

Reputation: 30723

You need to use an HTML-parsing library to build a data structure representing the HTML text on this webpage. My recommendation is to use this library: http://htmlparser.sourceforge.net.

Upvotes: 1

Related Questions