Alexander Demichev
Alexander Demichev

Reputation: 15

Wget file format

I have to download all site content and then parse the downloaded folder for "*.pdf" files. I am downloading site using wget -r --no-parent http://www.example.com/ But the problem is that sometimes link looks this

http://www.foodmanufuture.eu/dpubs?f=K20

and the dowloaded pdf is downloaded with name "dpubs?f=K20" and file format is not specified, it does not look like this "dpubs?f=K20.pdf", is there a way to check how many pdf files I have in this folder?

Upvotes: 0

Views: 83

Answers (2)

CannibalGorilla
CannibalGorilla

Reputation: 163

Have you tried the --content-disposition flag? From the man page:

If this is set to on, experimental (not fully-functional) support for "Content-Disposition" headers is enabled. This can currently result in extra round-trips to the server for a "HEAD" request, and is known to suffer from a few bugs, which is why it is not currently enabled by default. This option is useful for some file-downloading CGI programs that use "Content-Disposition" headers to describe what the name of a downloaded file should be.

So it tries to ask the server for a filename. I tried it for the URL you gave and it seemed to work.

Upvotes: 1

João Silva
João Silva

Reputation: 61

You could use the command

file filename

Like this:

file pdfurl-guide
pdfurl-guide: PDF document, version 1.5

You could use:

file * 

To know exactly which files in your folder are pdf files

Upvotes: 0

Related Questions