Sergio del Amo
Sergio del Amo

Reputation: 78096

Downloaded PDF with Java is corrupt?

I have read the excelent discussion about How to download and save a file from internet using Java. However, if I exectue the next code, i get a corrupt PDF. Any idea why?

import java.io.*;
import java.net.*;

public class PDFDownload {
    public static String URL = "http://www.nbc.com/Heroes/novels/downloads/";
    public static String FOLDER = "C:/Users/sdelamo/workspace/SandBox/HeroesNovel/";

    public static void main(String[] args) {
        String filename = "Heroes_novel_001.pdf";
        try {
            saveUrl(FOLDER + filename, URL + filename);
        } catch (MalformedURLException e) {
            System.out.println("MalformedURLException");
        } catch (IOException e) {
            System.out.println("IOException");                              
        }                       
    }       



    public static void saveUrl(String filename, String urlString) throws MalformedURLException, IOException {
        BufferedInputStream in = null;
        FileOutputStream fout = null;
        try {
            URL url = new URL(urlString);
            in = new BufferedInputStream(url.openStream());
            fout = new FileOutputStream(filename);

            byte data[] = new byte[1024];
            int count;
            while ((count = in.read(data, 0, 1024)) != -1) {
                fout.write(data, 0, count);
            }
        } finally {
            if (in != null)
                in.close();
            if (fout != null)
                fout.close();
        }
    }
}

The above code downloads html instead of a PDF. This is the output:

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML Basic 1.1//EN"
    "http://www.w3.org/TR/xhtml-basic/xhtml-basic11.dtd">

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>

<meta name="viewport" content="width=240, user-scalable=yes" />
<HTTP-EQUIV="PRAGMA" CONTENT="NO-CACHE">
<META HTTP-EQUIV="Expires" CONTENT="-1">
<meta http-equiv="Cache-control" content="no-cache">
<meta http-equiv="Cache-control" content="must-revalidate">
<meta http-equiv="Cache-control" content="max-age=0">
<meta http-equiv="refresh" content="200">

<title>NBC.com: Heroes</title>
<link rel="stylesheet" type="text/css"  href="/style/default.css?sid=8a9212f822e1c675330ec418bc531169" />
<link rel="stylesheet" type="text/css"  href="/style/hro.css?sid=8a9212f822e1c675330ec418bc531169" /> 

</head>
<body>
<center><img src="http://oimg.nbcuni.com/b/ss/nbcunbcnetworkwapbu,nbcuwapsitebu/5/H.8--WAP/4aa0e4cb8b448?vid=8a9212f822e1c675330ec418bc531169&gn=NBC.com Front Door&c2=&c3=Miscellaneous&c4=&c6=m.nbc.com/show/hro&c8=TV Entertainment&c9=NBC Network&c10=&c11= | &c12= | &c25=offdeck&c27=internal&c29=&c44=D=User-Agent&r=" width="5" height="5" border="0" /></center>
<h1 id="fHeader">
<a  href="/?sid=8a9212f822e1c675330ec418bc531169">
<img src="/images/nbc_logo.gif" alt="NBC : logo" border="0" />
</a>
</h1>

<h2>
<a  href="/show/hro?sid=8a9212f822e1c675330ec418bc531169">
<img src="/images/shows/1221684699_Heroes_WAP_166x54.jpg" alt="Heroes : showheader" border="0" />
</a>
</h2>
<div id="tunein_nexton">
    <span id="tunein">Mondays 9/8c</span>
</div><!--end #tunein_nexton-->
<div id="tunein_nexton">
    <!--<span id="tunein">Mondays 8/7c</span>-->

    <p id="nexton"><span class="sectiontitle"></span></p>
</div><!--end #tunein_nexton-->
<div id="featuredcontent">
    <h3>FEATURED CONTENT</h3>
    <table id="featuredItemsTable">

        <tr>
            <td><a  href="/show/hro/videos.html?sid=8a9212f822e1c675330ec418bc531169"><img src="/images/hro/nbc_hro_pro_040X921HRO120FLYPSIDE_exp921_20090_543_large.jpg" alt="featured" /></a>
            </td>
            <td>
                <span class="ftitle">Dreams</span>
                <span class="fdesc">Heroes premieres Mon., Sept. 21s...</span>
            </td>
        </tr>
                                        <tr>
            <td><a  href="/show/hro/recaps.html?sid=8a9212f822e1c675330ec418bc531169"><img src="http://origin-www.nbc.com/Heroes/images/episodes/season3/325/hro_325_01.jpg" alt="featured" height="45" width="80"/></a>
            </td>
            <td>
                <span class="ftitle">Recap:</span>
                <span class="fdesc">Season 3 Episode An Invisible Thread</span>
            </td>
        </tr>
                                        <tr>
            <td><a  href="/show/hro/photos.html?sid=8a9212f822e1c675330ec418bc531169"><img src="http://origin-www.nbc.com/app2/img/200x200xS/scet/photos/51/3736/NUP_110031_0323.JPG" alt="featured" height="45" width="80"/></a>
            </td>
            <td class="finfo">
                <span class="ftitle">Photo:</span>
                <span class="fdesc">Heroes "Cast Photos"</span>
            </td>
        </tr>
                    </table>


</div><!--end #featuredcontent-->

<h3>HEROES</h3>
<table class="showNav">
    <tr><td><a  href="/show/hro/about.html?sid=8a9212f822e1c675330ec418bc531169" accesskey="1">About</a></td></tr>
        <tr><td><a  href="/show/hro/videos.html?sid=8a9212f822e1c675330ec418bc531169" accesskey="2">Videos</a></td></tr>
                <tr><td><a  href="/show/hro/recaps.html?sid=8a9212f822e1c675330ec418bc531169" accesskey="3">Episode Recaps</a></td></tr>
                    <tr><td><a  href="/show/hro/photos.html?sid=8a9212f822e1c675330ec418bc531169" accesskey="4">Photos</a></td></tr>
                <tr><td><a  href="/show/hro/community.html?sid=8a9212f822e1c675330ec418bc531169" accesskey="5">Community</a></td></tr>
    <tr><td><a  href="/shows.shtml?sid=8a9212f822e1c675330ec418bc531169" accesskey="6">Shows List</a></td></tr>
</table>
<!-- <a  href="http://www.insightexpress.com/ix/Survey.aspx?id=151580&accessCode=3161643404&sid=8a9212f822e1c675330ec418bc531169" ><img src="/images/mNBCcom_166x54.jpg" border="0"></a> -->



<div class="footer" align="center"><a  href="http://m.nbc.com?sid=8a9212f822e1c675330ec418bc531169"><strong>NBC Mobile Main</strong></a> | <a  href="/terms.shtml?sid=8a9212f822e1c675330ec418bc531169"><strong>Terms of Use</strong></a> | <a  href="/privacy.shtml?sid=8a9212f822e1c675330ec418bc531169"><strong>Privacy</strong></a></div><div class="cpyrt" align="center">&#169; NBC Universal, Inc.</div>

</body>
</html>

Any idea how to download the PDF?

SOLUTION

Set User-Agent before connecting.

URL u = new URL(urlString); 
HttpURLConnection huc =  (HttpURLConnection)  u.openConnection();
huc.setRequestMethod("GET"); 
huc.setRequestProperty("User-Agent", "  Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 (.NET CLR 3.5.30729)");
huc.connect();          

in = new BufferedInputStream(huc.getInputStream());

Upvotes: 3

Views: 5066

Answers (5)

Hari Gudigundla
Hari Gudigundla

Reputation: 822

If setting User-Agent didn't solve it. It could be an issue with Cookies. Install simple browser plugins (EditThisCookie, HTTP Spy for Chrome) and check the Request & Response headers. Grab those cookie values and set them using the same HttpURLConnection.

Code: (Extension to the SOLUTION posted by Sergio del Amo)

URL u = new URL(urlString); 
HttpURLConnection huc =  (HttpURLConnection)  u.openConnection();
huc.setRequestMethod("GET"); 
huc.setRequestProperty("User-Agent", "  Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 (.NET CLR 3.5.30729)");

String myCookies = "cookie_name_1=cookie_value_1;cokoie_name_2=cookie_value_2";
huc.setRequestProperty("Cookie", myCookies);

huc.connect();          

in = new BufferedInputStream(huc.getInputStream());

Upvotes: 0

ZZ Coder
ZZ Coder

Reputation: 75456

This is the same issue with your other question. NBC.com doesn't send back PDF to you if it thinks you are a scraper :)

Same tricks will do,

conn.setRequestProperty("User-Agent", "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.13) Gecko/2009073021 Firefox/3.0.13");

Upvotes: 1

Vinay Sajip
Vinay Sajip

Reputation: 99297

For this kind of exploration, I highly recommend Jython (or Groovy, or ...). For example:

C:\Users\Vinay>jython
Jython 2.5.0 (Release_2_5_0:6476, Jun 16 2009, 13:33:26)
[Java HotSpot(TM) Client VM (Sun Microsystems Inc.)] on java1.6.0_16
Type "help", "copyright", "credits" or "license" for more information.
>>> s = "http://www.nbc.com/Heroes/novels/downloads/Heroes_novel_001.pdf"
>>> import java.net
>>> import jarray
>>> u = java.net.URL(s)
>>> os = u.openStream()
>>> buffer = jarray.zeros(1024, 'b')
>>> n = os.read(buffer, 0, 1024)
>>> java.lang.String(buffer)
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML Basic 1.1//EN"
    "http://www.w3.org/TR/xhtml-basic/xhtml-basic11.dtd">

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>

<meta name="viewport" content="width=240, user-scalable=yes" />
<HTTP-EQUIV="PRAGMA" CONTENT="NO-CACHE">
<META HTTP-EQUIV="Expires" CONTENT="-1">
<meta http-equiv="Cache-control" content="no-cache">
<meta http-equiv="Cache-control" content="must-revalidate">
<meta http-equiv="Cache-control" content="max-age=0">
 meta http-equiv="refresh" content="200">
<title>NBC.com: Heroes</title>
<link rel="stylesheet" type="text/css"  href="/style/default.css?sid=c67ddc30f79
ec4cc811f6e67e383fed7" />
<link rel="stylesheet" type="text/css"  href="/style/hro.css?sid=c67ddc30f79ec4c
c811f6e67e383fed7" />

</head>
<body>
<center><img src="http://oimg.nbcuni.com/b/ss/nbcunbcnetworkwapbu,nbcuwapsitebu/
5/H.8--WAP/4aa0e7ce2535c?vid=c67ddc30f79ec4cc811f6e67e383fed7&gn=NBC.com Front
>>>

which confirms what you found, but without edit/compile cycles to get in the way. Just my 2 cents...

As for how to get the data - it may be that you have to spoof your User-Agent header. From Firefox, the same URL returns a Content-Type of application/pdf, and the PDF file.

Update: The following Jython script:

import java.net
import jarray

s = "http://www.nbc.com/Heroes/novels/downloads/Heroes_novel_001.pdf"
u = java.net.URL(s)
c = u.openConnection()
c.setRequestProperty("User-Agent", "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.2) Gecko/20090810 Ubuntu/9.10 (karmic) Firefox/3.5.2")
BUFLEN = 4
buffer = jarray.zeros(BUFLEN, 'b')
c.connect()
stream = c.getInputStream()
stream.read(buffer, 0, BUFLEN)
data = java.lang.String(buffer)
print data

prints

%PDF

so the site is looking at the User-Agent header.

Upvotes: 1

McDowell
McDowell

Reputation: 108859

Inspect the resultant file - I expect it is a HTML file. The site probably returns an error if there is no referrer or uses a JavaScript redirect page or something. You can use the HttpURLConnection class to check the HTTP headers returned by the server.

URL url = new URL(
    "http://www.nbc.com/Heroes/novels/downloads/Heroes_novel_001.pdf");
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
conn.setRequestMethod("HEAD");
try {
  for (Map.Entry<String, List<String>> header : conn.getHeaderFields()
      .entrySet()) {
    System.out.println(header.getKey() + "=" + header.getValue());
  }
} finally {
  conn.disconnect();
}

The above code returns a Content-Type of text/html.

Upvotes: 1

Jesper
Jesper

Reputation: 206776

Have you tried looking inside the downloaded file with for example a text editor?

You'll see that it contains a HTML page, and not a PDF. Probably the URL does not point to the PDF, or there is some redirecting going on, which the standard java.net classes don't support by default.

Make sure the URL correctly points to the PDF. You could use Apache HttpClient for doing more sophisticated things with HTTP, including automatically handling HTTP redirects.

Note: The code you posted does not compile, because you placed a } wrongly.

Upvotes: 4

Related Questions