Reputation: 97
Trying to download all pdf files in the website and I have a bad code. I guess there is a better out there. Anyways here is it:
try {
System.out.println("Download started");
URL getURL = new URL("http://cs.lth.se/eda095/foerelaesningar/?no_cache=1");
URL pdf;
URLConnection urlC = getURL.openConnection();
InputStream is = urlC.getInputStream();
BufferedReader buffRead = new BufferedReader(new InputStreamReader(is));
FileOutputStream fos = null;
byte[] b = new byte[1024];
String line;
double i = 1;
int t = 1;
int length;
while((line = buffRead.readLine()) != null) {
while((length = is.read(b)) > -1) {
if(line.contains(".pdf")) {
pdf = new URL("http://fileadmin.cs.lth.se/cs/Education/EDA095/2015/lectures/"
+ "f" + i + "-" + t + "x" + t);
fos = new FileOutputStream(new File("fil" + i + "-" + t + "x" + t + ".pdf"));
fos.write(b, 0, line.length());
i += 0.5;
t += 1;
if(t > 2) {
t = 1;
}
}
}
}
is.close();
System.out.println("Download finished");
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
The files I get is damage, BUT is there a better way to download the PDF files? Because on the site some of the files are f1-1x1, f1-2x2, f2-1x1.. But what IF the files were donalds.pdf stack.pdf etc..
So the question would be, How do I make my code better to download all the pdf files?
Upvotes: 0
Views: 3005
Reputation: 140407
Basically you are asking: "how can I parse HTML reliably; to identify all download links that point to PDF files".
Anything else (like what you have right now; to anticipate how links would/could/should look like) will be a constant source for grieve; because any update to your web site; or trying to run your code against another different web site is very likely to fail. And that is because HTML is complex and has so many flavors that you should simply forget about "easy" solutions to analyse HTML content.
In that sense: learn how to use an HTML parser; a first starting point could be Which HTML Parser is the best?
Upvotes: 2