a name
a name

Reputation: 19

Certain PDF files are not downloading correctly

I have very little experience in JAVA (working on my first real program) been looking for a solution for hours. I have hacked together a small program to download PDF files from a link. It works fine for most links but some of them just don't work.

The connection type for all the links that works show up as application/pdf but some links show a connection of text/html for some reason.

I keep trying to rewrite the code using whatever I can find online but I keep getting the same result.

import java.io.BufferedInputStream;
import java.io.BufferedOutputStream;
import java.io.FileOutputStream;
import java.net.ConnectException;
import java.net.URL;
import java.net.URLConnection;

public class Main {

public static void main(String[] args) throws Exception {

    String link = "https://www.menards.com/main/items/media/UNITE051/SDS/SpectracideVegetationKillerReadyToUse2-228-714-8845-SDS-Feb16.pdf";
    String fileName = "File Name.pdf";

    URL url1 = new URL(link);

    try {
        URLConnection urlConn = url1.openConnection();
        byte[] buffer = new byte[1024];
        double downloaded = 0.00;
        int read = 0;

        System.out.println(urlConn.getContentType()); // This shows as text/html but it should be PDF

        FileOutputStream fos1 = new FileOutputStream(fileName);
        BufferedInputStream is1 = new BufferedInputStream(urlConn.getInputStream());
        BufferedOutputStream bout = new BufferedOutputStream(fos1, 1024);

        try {

            while ((read = is1.read(buffer, 0, 1024)) >= 0) {
                bout.write(buffer, 0, read);
                downloaded += read;
            }

            bout.close();
            fos1.flush();
            fos1.close();
            is1.close();

        } catch (Exception e) {}
    } catch (Exception e) {}

}

}

I need to be able to download the PDF from the link in the code.

This is what is saved in a text document of the PDF:

<html>
<head>
<META NAME="robots" CONTENT="noindex,nofollow">
<script src="/_Incapsula_Resource?SWJIYLWA=5074a744e2e3d891814e9a2dace20bd4,719d34d31c8e3a6e6fffd425f7e032f3">
</script>
<body>
</body></html>

Upvotes: 1

Views: 502

Answers (2)

a name
a name

Reputation: 19

The website implemented a check to make sure I was using a browser. I copied the user agent from chrome and it allowed me to download the PDF.

Upvotes: 1

vavasthi
vavasthi

Reputation: 952

The URL that you are fetching doesn't point to a PDF file. It is pointing to a HTML file which embeds the PDF file. You probably need to closely look at what is the URL to PDF file. You code seems alright.

Just do a cURL on the URL and see. It will most probably return a HTML file.

Upvotes: 0

Related Questions