Butt Head
Butt Head

Reputation: 21

What am I screwing up trying to download particular file types with wget?

I am attempting to regularly archive a few file types hosted on a community website where our admin has been MIA for years, in case he dies or just stops paying for the hosting.

I am able to download all of the files I need using wget -r -np -nd -e robots=off -l 0 URL but this leaves me with about 60,000 extra files to waste time both downloading and deleting.

I am really only looking for files with the extensions "tbt" and "zip". When I add in -A tbt,zip to the input, wget then only downloads a single file, "index.html.tmp". It immediately deletes this file because it doesn't match the file type specified, and then the process stops entirely, with wget announcing that it is finished. It does not attempt to download any of the other files that it grabs when the -A flag is not included.

What am I doing wrong? Why does specifying file types in the way that I did cause it to finish after only looking at one file?

Upvotes: 2

Views: 1612

Answers (2)

makeworld
makeworld

Reputation: 1742

I also experienced this issue, on a page where all the download links looked something like this: filedownload.ashx?name=file.mp3. The solution was to match for both the linked file, and the downloaded file. So my wget accept flag looked like this: -A 'ashx,mp3'. I also used the --trust-server-names flag. This catches all the .ashx that are linked in the webpage, then when wget does the second check, all the mp3 files that were downloaded will stay.

As an alternative to --trust-server-names, you may also find the --content-disposition flag helpful. Both flags help rename the file that gets downloaded from filedownload.ashx?name=file.mp3 to just file.mp3.

Upvotes: 0

larsks
larsks

Reputation: 312530

Possibly you're hitting the same problem I've hit when trying to do something similar. When using --accept, wget determines whether a links refers to a file or directory based on whether or not it ends with a /.

For example, say I have a directory named files, and a web page that has:

<a href="files">Lots o' files!</a>

If I were to request this with wget -r, then I wget would happily GET /files, see that it was an HTML document containing a bunch of links, and continue to download those links.

However, if I add -A zip to my command line, and run wget with --debug, I see:

appending ‘http://localhost:8080/files’ to urlpos.
[...]
Deciding whether to enqueue "http://localhost:8080/files".
http://localhost:8080/files (files) does not match acc/rej rules.
Decided NOT to load it.

In other words, wget thinks this is a file (no trailing /) and it doesn't match our acceptance criteria, so it gets rejected.

If I modify the remote file so that it looks like...

<a href="files/">Lots o' files!</a>

...then wget will follow the link and download files as desired.

I don't think there's a great solution to this problem if you need to use wget. As I mentioned in my comment, there are other tools available that may handle this situation more gracefully.

It's also possible you're experiencing a different issue; the output of adding --debug to your command line clarify things in that case.

Upvotes: 1

Related Questions