Reputation: 21

How to download the exact source code of the webpage

I want to download the source code of the webpage. I have used URL method i.e URL url=new URL("http://a.html");

and Jsoup method but not getting the exact data as mentioned in actual source code . for example-

<input type="image"
       name="ctl00$dtlAlbums$ctl00$imbAlbumImage"    
       id="ctl00_dtlAlbums_ctl00_imbAlbumImage"
       title="Independence Day Celebr..."
       border="0"         
       onmouseover="AlbumImageSlideShow('ctl00_dtlAlbums_ctl00_imbAlbumImage','ctl00_dtlAlbums_ctl00_hdThumbnails','0','Uploads/imagegallary/135/Thumbnails/IMG_3206.JPG','Uploads/imagegallary/135/Thumbnails/');"
       onmouseout="AlbumImageSlideShow('ctl00_dtlAlbums_ctl00_imbAlbumImage','ctl00_dtlAlbums_ctl00_hdThumbnails','1','Uploads/imagegallary/135/Thumbnails/IMG_3206.JPG','Uploads/imagegallary/135/Thumbnails/');" 
       src="Uploads/imagegallary/135/Thumbnails/IMG_3206.JPG"     
       alt="Independence Day Celebr..." 
       style="height:79px;width:148px;border-width:0px;"
/>

In this tag the last attribute 'style' is not detecting by the code of jsoup. and if I am downloading it from URL method, It changes the style tag into border=""/> attribute.

Can any body tell me the way to download the exact source code of a webpage? My code is-

URL url=new URL("http://www.apcob.org/");
InputStream is = url.openStream();  // throws an IOException
BufferedReader  br = new BufferedReader(new InputStreamReader(is));
String line;
File fileDir = new File(contextpath+"\\extractedtxt.txt");
Writer fw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(fileDir), "UTF8"));
while ((line = br.readLine()) != null)
{
 // System.out.println("line\n "+line);
  fw.write("\n"+line);
}
 InputStream in = new FileInputStream(new File(contextpath+"extractedtxt.txt";));
String baseUrl="http://www.apcob.org/";
Document doc=Jsoup.parse(in,"UTF-8",baseUrl);
System.out.println(doc);

Second method I followed is-

Document doc = Jsoup.connect(url_of_currentpage).get();

I want to do this in java and the website name is 'http://www.apcob.org/' on which this problem is occurred.

Upvotes: 2

Answers (5)

Jenna Sloan

Reputation: 382

When getting a webpage via http, the web server usually formats the sorce in some way; you can't get the exact sorce of a php file using http. As far as I know, the only way to accomplish what you are asking is by using ftp.

Upvotes: 0

gnsb

Reputation: 326

Here is a handy function to fetch webpage. Get the HTML String using this. Then parse the String to Document using JSOUP.

public static String fetchPage(String urlFullAddress) throws IOException {
//      String proxy = "10.3.100.207";
//      int port = 8080;
        URL url = new URL(urlFullAddress);
        HttpURLConnection connection = null;
//      Proxy proxyConnect = new Proxy(Proxy.Type.HTTP, new InetSocketAddress(proxy, port));
        connection = (HttpURLConnection) url.openConnection();//proxyConnect);
        connection.setDoOutput(true);
        connection.setDoInput(true);

        connection.addRequestProperty("User-Agent",
                "Mozilla/5.0 (iPad; U; CPU OS 3_2 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Version/4.0.4 Mobile/7B334b Safari/531.21.10')");
        connection.setReadTimeout(5000); // set timeout

        connection.addRequestProperty("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
        connection.addRequestProperty("Accept-Language", "en-US,en;q=0.5");
        connection.addRequestProperty("Accept-Encoding", "gzip, deflate");
        connection.addRequestProperty("connection", "keep-alive");
        System.setProperty("http.keepAlive", "true");

        BufferedReader in = new BufferedReader(new InputStreamReader(connection.getInputStream()));

        String urlString = "";
        String current;
        while ((current = in.readLine()) != null) {
            urlString += current;
        }

        return urlString;   
}

If the problem turns out to be with JSOUP Parser, try using http://jericho.htmlparser.net/docs/index.html. It parses HTML as-is, without correcting errors.

Few other things I noticed: You did not close fw. Replace UTF8 with UTF-8`. If you need to parse a lot of CSS, try a CSS-Parser

Upvotes: 0

Pallav Jha

Reputation: 3619

I guess this would work fine,

public static void main(String[] args) throws Exception {
    //Only If you're using a proxy
    //System.setProperty("java.net.useSystemProxies", "true");

    URL url = new URL("http://www.apcob.org/");

    HttpURLConnection yc = (HttpURLConnection) url.openConnection();
    yc.addRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36");
    BufferedReader in = new BufferedReader(new InputStreamReader(yc.getInputStream()));

    String inputLine;
    while ((inputLine = in.readLine()) != null)
        System.out.println(inputLine);
    in.close();
}

Upvotes: 1

Stephan

Reputation: 43053

The page you're trying to download is modified somehow by a javascript code. Jsoup is an html parser. It doesn't run javascript.

If you want to get the source code like you can see it in Chrome, use one of the following tool:

All three can parse and run the Javascript code inside the page.

Upvotes: 2

TDG

Reputation: 6171

It's probably due to a different user agent string - when you browse the page from your browser, it sends a user agent string with the browser's type. Some sites respond with different pages to different browsers (eg. mobile devices).
Try to add the same user agent string as your browser's.

Upvotes: 2

How to download the exact source code of the webpage

Answers (5)

Related Questions