Reputation: 15
I have to download all site content and then parse the downloaded folder for "*.pdf" files. I am downloading site using wget -r --no-parent http://www.example.com/
But the problem is that sometimes link looks this
http://www.foodmanufuture.eu/dpubs?f=K20
and the dowloaded pdf is downloaded with name "dpubs?f=K20" and file format is not specified, it does not look like this "dpubs?f=K20.pdf", is there a way to check how many pdf files I have in this folder?
Upvotes: 0
Views: 83
Reputation: 163
Have you tried the --content-disposition
flag? From the man page:
If this is set to on, experimental (not fully-functional) support for "Content-Disposition" headers is enabled. This can currently result in extra round-trips to the server for a "HEAD" request, and is known to suffer from a few bugs, which is why it is not currently enabled by default. This option is useful for some file-downloading CGI programs that use "Content-Disposition" headers to describe what the name of a downloaded file should be.
So it tries to ask the server for a filename. I tried it for the URL you gave and it seemed to work.
Upvotes: 1
Reputation: 61
You could use the command
file filename
Like this:
file pdfurl-guide
pdfurl-guide: PDF document, version 1.5
You could use:
file *
To know exactly which files in your folder are pdf files
Upvotes: 0