Reputation: 21
I want to download the source code of the webpage. I have used URL method i.e URL url=new URL("http://a.html");
and Jsoup method but not getting the exact data as mentioned in actual source code . for example-
<input type="image"
name="ctl00$dtlAlbums$ctl00$imbAlbumImage"
id="ctl00_dtlAlbums_ctl00_imbAlbumImage"
title="Independence Day Celebr..."
border="0"
onmouseover="AlbumImageSlideShow('ctl00_dtlAlbums_ctl00_imbAlbumImage','ctl00_dtlAlbums_ctl00_hdThumbnails','0','Uploads/imagegallary/135/Thumbnails/IMG_3206.JPG','Uploads/imagegallary/135/Thumbnails/');"
onmouseout="AlbumImageSlideShow('ctl00_dtlAlbums_ctl00_imbAlbumImage','ctl00_dtlAlbums_ctl00_hdThumbnails','1','Uploads/imagegallary/135/Thumbnails/IMG_3206.JPG','Uploads/imagegallary/135/Thumbnails/');"
src="Uploads/imagegallary/135/Thumbnails/IMG_3206.JPG"
alt="Independence Day Celebr..."
style="height:79px;width:148px;border-width:0px;"
/>
In this tag the last attribute 'style' is not detecting by the code of jsoup. and if I am downloading it from URL method, It changes the style tag into border=""/> attribute.
Can any body tell me the way to download the exact source code of a webpage? My code is-
URL url=new URL("http://www.apcob.org/");
InputStream is = url.openStream(); // throws an IOException
BufferedReader br = new BufferedReader(new InputStreamReader(is));
String line;
File fileDir = new File(contextpath+"\\extractedtxt.txt");
Writer fw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(fileDir), "UTF8"));
while ((line = br.readLine()) != null)
{
// System.out.println("line\n "+line);
fw.write("\n"+line);
}
InputStream in = new FileInputStream(new File(contextpath+"extractedtxt.txt";));
String baseUrl="http://www.apcob.org/";
Document doc=Jsoup.parse(in,"UTF-8",baseUrl);
System.out.println(doc);
Second method I followed is-
Document doc = Jsoup.connect(url_of_currentpage).get();
I want to do this in java and the website name is 'http://www.apcob.org/' on which this problem is occurred.
Upvotes: 2
Views: 1263
Reputation: 382
When getting a webpage via http
, the web server usually formats the sorce in some way; you can't get the exact sorce of a php
file using http
.
As far as I know, the only way to accomplish what you are asking is by using ftp
.
Upvotes: 0
Reputation: 326
Here is a handy function to fetch webpage. Get the HTML String using this. Then parse the String
to Document
using JSOUP
.
public static String fetchPage(String urlFullAddress) throws IOException {
// String proxy = "10.3.100.207";
// int port = 8080;
URL url = new URL(urlFullAddress);
HttpURLConnection connection = null;
// Proxy proxyConnect = new Proxy(Proxy.Type.HTTP, new InetSocketAddress(proxy, port));
connection = (HttpURLConnection) url.openConnection();//proxyConnect);
connection.setDoOutput(true);
connection.setDoInput(true);
connection.addRequestProperty("User-Agent",
"Mozilla/5.0 (iPad; U; CPU OS 3_2 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Version/4.0.4 Mobile/7B334b Safari/531.21.10')");
connection.setReadTimeout(5000); // set timeout
connection.addRequestProperty("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
connection.addRequestProperty("Accept-Language", "en-US,en;q=0.5");
connection.addRequestProperty("Accept-Encoding", "gzip, deflate");
connection.addRequestProperty("connection", "keep-alive");
System.setProperty("http.keepAlive", "true");
BufferedReader in = new BufferedReader(new InputStreamReader(connection.getInputStream()));
String urlString = "";
String current;
while ((current = in.readLine()) != null) {
urlString += current;
}
return urlString;
}
If the problem turns out to be with JSOUP Parser, try using http://jericho.htmlparser.net/docs/index.html. It parses HTML as-is, without correcting errors.
Few other things I noticed:
You did not close fw
. Replace UTF8
with UTF-8`.
If you need to parse a lot of CSS, try a CSS-Parser
Upvotes: 0
Reputation: 3619
I guess this would work fine,
public static void main(String[] args) throws Exception {
//Only If you're using a proxy
//System.setProperty("java.net.useSystemProxies", "true");
URL url = new URL("http://www.apcob.org/");
HttpURLConnection yc = (HttpURLConnection) url.openConnection();
yc.addRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36");
BufferedReader in = new BufferedReader(new InputStreamReader(yc.getInputStream()));
String inputLine;
while ((inputLine = in.readLine()) != null)
System.out.println(inputLine);
in.close();
}
Upvotes: 1
Reputation: 43053
The page you're trying to download is modified somehow by a javascript code. Jsoup is an html parser. It doesn't run javascript.
If you want to get the source code like you can see it in Chrome, use one of the following tool:
All three can parse and run the Javascript code inside the page.
Upvotes: 2
Reputation: 6171
It's probably due to a different user agent
string - when you browse the page from your browser, it sends a user agent
string with the browser's type. Some sites respond with different pages to different browsers (eg. mobile devices).
Try to add the same user agent
string as your browser's.
Upvotes: 2