user18101
user18101

Reputation: 635

wget downloading only PDFs from website

I am trying to download all PDFs from http://www.fayette-pva.com/.

I believe the problem is that when hovering over the link to download the PDF chrome shows the URL in the bottom left hand corner without a .pdf file extension. I saw and used another forum answer similar to this but the .pdf extension was used for the URL when hovering over the PDF link with my cursor. I have tried the same code that is in the link below but it doesn't pick up the PDF files.

Here is the code I have been testing with:

wget --no-directories -e robots=off -A.pdf -r -l1 \
    http://www.fayette-pva.com/sales-reports/salesreport03-feb-09feb2015/

I am using this on a single page of which I know that it has a PDF on it.

The complete code should be something like

wget --no-directories -e robots=off -A.pdf -r http://www.fayette-pva.com/

Related answer: WGET problem downloading pdfs from website

I am not sure if downloading the entire website would work and if it wouldn't take forever. How do I get around this and download only the PDFs?

Upvotes: 7

Views: 16699

Answers (1)

zb226
zb226

Reputation: 10539

Yes, the problem is precisely what you stated: The URLs do not contain regular or absolute filenames, but are calls to a script/servlet/... which hands out the actual files.

The solution is to use the --content-disposition option, which tells wget to honor the Content-Disposition field in the HTTP response, which carries the actual filename:

HTTP/1.1 200 OK
(...)
Content-Disposition: attachment; filename="SalesIndexThru09Feb2015.pdf"
(...)
Connection: close

This option is supported in wget at least since version 1.11.4, which is already 7 years old.

So you would do the following:

wget --no-directories --content-disposition -e robots=off -A.pdf -r \
    http://www.fayette-pva.com/

Upvotes: 12

Related Questions