Reputation: 635
I am trying to download all PDFs from http://www.fayette-pva.com/.
I believe the problem is that when hovering over the link to download the PDF chrome shows the URL in the bottom left hand corner without a .pdf
file extension. I saw and used another forum answer similar to this but the .pdf
extension was used for the URL when hovering over the PDF link with my cursor. I have tried the same code that is in the link below but it doesn't pick up the PDF files.
Here is the code I have been testing with:
wget --no-directories -e robots=off -A.pdf -r -l1 \
http://www.fayette-pva.com/sales-reports/salesreport03-feb-09feb2015/
I am using this on a single page of which I know that it has a PDF on it.
The complete code should be something like
wget --no-directories -e robots=off -A.pdf -r http://www.fayette-pva.com/
Related answer: WGET problem downloading pdfs from website
I am not sure if downloading the entire website would work and if it wouldn't take forever. How do I get around this and download only the PDFs?
Upvotes: 7
Views: 16699
Reputation: 10539
Yes, the problem is precisely what you stated: The URLs do not contain regular or absolute filenames, but are calls to a script/servlet/... which hands out the actual files.
The solution is to use the --content-disposition
option, which tells wget
to honor the Content-Disposition
field in the HTTP response, which carries the actual filename:
HTTP/1.1 200 OK
(...)
Content-Disposition: attachment; filename="SalesIndexThru09Feb2015.pdf"
(...)
Connection: close
This option is supported in wget
at least since version 1.11.4
, which is already 7 years old.
So you would do the following:
wget --no-directories --content-disposition -e robots=off -A.pdf -r \
http://www.fayette-pva.com/
Upvotes: 12