Reputation: 78096
I have read the excelent discussion about How to download and save a file from internet using Java. However, if I exectue the next code, i get a corrupt PDF. Any idea why?
import java.io.*;
import java.net.*;
public class PDFDownload {
public static String URL = "http://www.nbc.com/Heroes/novels/downloads/";
public static String FOLDER = "C:/Users/sdelamo/workspace/SandBox/HeroesNovel/";
public static void main(String[] args) {
String filename = "Heroes_novel_001.pdf";
try {
saveUrl(FOLDER + filename, URL + filename);
} catch (MalformedURLException e) {
System.out.println("MalformedURLException");
} catch (IOException e) {
System.out.println("IOException");
}
}
public static void saveUrl(String filename, String urlString) throws MalformedURLException, IOException {
BufferedInputStream in = null;
FileOutputStream fout = null;
try {
URL url = new URL(urlString);
in = new BufferedInputStream(url.openStream());
fout = new FileOutputStream(filename);
byte data[] = new byte[1024];
int count;
while ((count = in.read(data, 0, 1024)) != -1) {
fout.write(data, 0, count);
}
} finally {
if (in != null)
in.close();
if (fout != null)
fout.close();
}
}
}
The above code downloads html instead of a PDF. This is the output:
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML Basic 1.1//EN"
"http://www.w3.org/TR/xhtml-basic/xhtml-basic11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta name="viewport" content="width=240, user-scalable=yes" />
<HTTP-EQUIV="PRAGMA" CONTENT="NO-CACHE">
<META HTTP-EQUIV="Expires" CONTENT="-1">
<meta http-equiv="Cache-control" content="no-cache">
<meta http-equiv="Cache-control" content="must-revalidate">
<meta http-equiv="Cache-control" content="max-age=0">
<meta http-equiv="refresh" content="200">
<title>NBC.com: Heroes</title>
<link rel="stylesheet" type="text/css" href="/style/default.css?sid=8a9212f822e1c675330ec418bc531169" />
<link rel="stylesheet" type="text/css" href="/style/hro.css?sid=8a9212f822e1c675330ec418bc531169" />
</head>
<body>
<center><img src="http://oimg.nbcuni.com/b/ss/nbcunbcnetworkwapbu,nbcuwapsitebu/5/H.8--WAP/4aa0e4cb8b448?vid=8a9212f822e1c675330ec418bc531169&gn=NBC.com Front Door&c2=&c3=Miscellaneous&c4=&c6=m.nbc.com/show/hro&c8=TV Entertainment&c9=NBC Network&c10=&c11= | &c12= | &c25=offdeck&c27=internal&c29=&c44=D=User-Agent&r=" width="5" height="5" border="0" /></center>
<h1 id="fHeader">
<a href="/?sid=8a9212f822e1c675330ec418bc531169">
<img src="/images/nbc_logo.gif" alt="NBC : logo" border="0" />
</a>
</h1>
<h2>
<a href="/show/hro?sid=8a9212f822e1c675330ec418bc531169">
<img src="/images/shows/1221684699_Heroes_WAP_166x54.jpg" alt="Heroes : showheader" border="0" />
</a>
</h2>
<div id="tunein_nexton">
<span id="tunein">Mondays 9/8c</span>
</div><!--end #tunein_nexton-->
<div id="tunein_nexton">
<!--<span id="tunein">Mondays 8/7c</span>-->
<p id="nexton"><span class="sectiontitle"></span></p>
</div><!--end #tunein_nexton-->
<div id="featuredcontent">
<h3>FEATURED CONTENT</h3>
<table id="featuredItemsTable">
<tr>
<td><a href="/show/hro/videos.html?sid=8a9212f822e1c675330ec418bc531169"><img src="/images/hro/nbc_hro_pro_040X921HRO120FLYPSIDE_exp921_20090_543_large.jpg" alt="featured" /></a>
</td>
<td>
<span class="ftitle">Dreams</span>
<span class="fdesc">Heroes premieres Mon., Sept. 21s...</span>
</td>
</tr>
<tr>
<td><a href="/show/hro/recaps.html?sid=8a9212f822e1c675330ec418bc531169"><img src="http://origin-www.nbc.com/Heroes/images/episodes/season3/325/hro_325_01.jpg" alt="featured" height="45" width="80"/></a>
</td>
<td>
<span class="ftitle">Recap:</span>
<span class="fdesc">Season 3 Episode An Invisible Thread</span>
</td>
</tr>
<tr>
<td><a href="/show/hro/photos.html?sid=8a9212f822e1c675330ec418bc531169"><img src="http://origin-www.nbc.com/app2/img/200x200xS/scet/photos/51/3736/NUP_110031_0323.JPG" alt="featured" height="45" width="80"/></a>
</td>
<td class="finfo">
<span class="ftitle">Photo:</span>
<span class="fdesc">Heroes "Cast Photos"</span>
</td>
</tr>
</table>
</div><!--end #featuredcontent-->
<h3>HEROES</h3>
<table class="showNav">
<tr><td><a href="/show/hro/about.html?sid=8a9212f822e1c675330ec418bc531169" accesskey="1">About</a></td></tr>
<tr><td><a href="/show/hro/videos.html?sid=8a9212f822e1c675330ec418bc531169" accesskey="2">Videos</a></td></tr>
<tr><td><a href="/show/hro/recaps.html?sid=8a9212f822e1c675330ec418bc531169" accesskey="3">Episode Recaps</a></td></tr>
<tr><td><a href="/show/hro/photos.html?sid=8a9212f822e1c675330ec418bc531169" accesskey="4">Photos</a></td></tr>
<tr><td><a href="/show/hro/community.html?sid=8a9212f822e1c675330ec418bc531169" accesskey="5">Community</a></td></tr>
<tr><td><a href="/shows.shtml?sid=8a9212f822e1c675330ec418bc531169" accesskey="6">Shows List</a></td></tr>
</table>
<!-- <a href="http://www.insightexpress.com/ix/Survey.aspx?id=151580&accessCode=3161643404&sid=8a9212f822e1c675330ec418bc531169" ><img src="/images/mNBCcom_166x54.jpg" border="0"></a> -->
<div class="footer" align="center"><a href="http://m.nbc.com?sid=8a9212f822e1c675330ec418bc531169"><strong>NBC Mobile Main</strong></a> | <a href="/terms.shtml?sid=8a9212f822e1c675330ec418bc531169"><strong>Terms of Use</strong></a> | <a href="/privacy.shtml?sid=8a9212f822e1c675330ec418bc531169"><strong>Privacy</strong></a></div><div class="cpyrt" align="center">© NBC Universal, Inc.</div>
</body>
</html>
Any idea how to download the PDF?
SOLUTION
Set User-Agent before connecting.
URL u = new URL(urlString);
HttpURLConnection huc = (HttpURLConnection) u.openConnection();
huc.setRequestMethod("GET");
huc.setRequestProperty("User-Agent", " Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 (.NET CLR 3.5.30729)");
huc.connect();
in = new BufferedInputStream(huc.getInputStream());
Upvotes: 3
Views: 5066
Reputation: 822
If setting User-Agent didn't solve it. It could be an issue with Cookies. Install simple browser plugins (EditThisCookie, HTTP Spy for Chrome) and check the Request & Response headers. Grab those cookie values and set them using the same HttpURLConnection.
Code: (Extension to the SOLUTION posted by Sergio del Amo)
URL u = new URL(urlString);
HttpURLConnection huc = (HttpURLConnection) u.openConnection();
huc.setRequestMethod("GET");
huc.setRequestProperty("User-Agent", " Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 (.NET CLR 3.5.30729)");
String myCookies = "cookie_name_1=cookie_value_1;cokoie_name_2=cookie_value_2";
huc.setRequestProperty("Cookie", myCookies);
huc.connect();
in = new BufferedInputStream(huc.getInputStream());
Upvotes: 0
Reputation: 75456
This is the same issue with your other question. NBC.com doesn't send back PDF to you if it thinks you are a scraper :)
Same tricks will do,
conn.setRequestProperty("User-Agent", "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.13) Gecko/2009073021 Firefox/3.0.13");
Upvotes: 1
Reputation: 99297
For this kind of exploration, I highly recommend Jython (or Groovy, or ...). For example:
C:\Users\Vinay>jython Jython 2.5.0 (Release_2_5_0:6476, Jun 16 2009, 13:33:26) [Java HotSpot(TM) Client VM (Sun Microsystems Inc.)] on java1.6.0_16 Type "help", "copyright", "credits" or "license" for more information. >>> s = "http://www.nbc.com/Heroes/novels/downloads/Heroes_novel_001.pdf" >>> import java.net >>> import jarray >>> u = java.net.URL(s) >>> os = u.openStream() >>> buffer = jarray.zeros(1024, 'b') >>> n = os.read(buffer, 0, 1024) >>> java.lang.String(buffer)
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML Basic 1.1//EN"
"http://www.w3.org/TR/xhtml-basic/xhtml-basic11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta name="viewport" content="width=240, user-scalable=yes" />
<HTTP-EQUIV="PRAGMA" CONTENT="NO-CACHE">
<META HTTP-EQUIV="Expires" CONTENT="-1">
<meta http-equiv="Cache-control" content="no-cache">
<meta http-equiv="Cache-control" content="must-revalidate">
<meta http-equiv="Cache-control" content="max-age=0">
meta http-equiv="refresh" content="200">
<title>NBC.com: Heroes</title>
<link rel="stylesheet" type="text/css" href="/style/default.css?sid=c67ddc30f79
ec4cc811f6e67e383fed7" />
<link rel="stylesheet" type="text/css" href="/style/hro.css?sid=c67ddc30f79ec4c
c811f6e67e383fed7" />
</head>
<body>
<center><img src="http://oimg.nbcuni.com/b/ss/nbcunbcnetworkwapbu,nbcuwapsitebu/
5/H.8--WAP/4aa0e7ce2535c?vid=c67ddc30f79ec4cc811f6e67e383fed7&gn=NBC.com Front
>>>
which confirms what you found, but without edit/compile cycles to get in the way. Just my 2 cents...
As for how to get the data - it may be that you have to spoof your User-Agent
header. From Firefox, the same URL returns a Content-Type
of application/pdf
, and the PDF file.
Update: The following Jython script:
import java.net
import jarray
s = "http://www.nbc.com/Heroes/novels/downloads/Heroes_novel_001.pdf"
u = java.net.URL(s)
c = u.openConnection()
c.setRequestProperty("User-Agent", "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.2) Gecko/20090810 Ubuntu/9.10 (karmic) Firefox/3.5.2")
BUFLEN = 4
buffer = jarray.zeros(BUFLEN, 'b')
c.connect()
stream = c.getInputStream()
stream.read(buffer, 0, BUFLEN)
data = java.lang.String(buffer)
print data
prints
so the site is looking at the User-Agent
header.
Upvotes: 1
Reputation: 108859
Inspect the resultant file - I expect it is a HTML file. The site probably returns an error if there is no referrer or uses a JavaScript redirect page or something. You can use the HttpURLConnection class to check the HTTP headers returned by the server.
URL url = new URL(
"http://www.nbc.com/Heroes/novels/downloads/Heroes_novel_001.pdf");
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
conn.setRequestMethod("HEAD");
try {
for (Map.Entry<String, List<String>> header : conn.getHeaderFields()
.entrySet()) {
System.out.println(header.getKey() + "=" + header.getValue());
}
} finally {
conn.disconnect();
}
The above code returns a Content-Type
of text/html
.
Upvotes: 1
Reputation: 206776
Have you tried looking inside the downloaded file with for example a text editor?
You'll see that it contains a HTML page, and not a PDF. Probably the URL does not point to the PDF, or there is some redirecting going on, which the standard java.net
classes don't support by default.
Make sure the URL correctly points to the PDF. You could use Apache HttpClient for doing more sophisticated things with HTTP, including automatically handling HTTP redirects.
Note: The code you posted does not compile, because you placed a }
wrongly.
Upvotes: 4