Reputation: 21
I am attempting to regularly archive a few file types hosted on a community website where our admin has been MIA for years, in case he dies or just stops paying for the hosting.
I am able to download all of the files I need using wget -r -np -nd -e robots=off -l 0 URL
but this leaves me with about 60,000 extra files to waste time both downloading and deleting.
I am really only looking for files with the extensions "tbt" and "zip". When I add in -A tbt,zip
to the input, wget then only downloads a single file, "index.html.tmp". It immediately deletes this file because it doesn't match the file type specified, and then the process stops entirely, with wget announcing that it is finished. It does not attempt to download any of the other files that it grabs when the -A
flag is not included.
What am I doing wrong? Why does specifying file types in the way that I did cause it to finish after only looking at one file?
Upvotes: 2
Views: 1612
Reputation: 1742
I also experienced this issue, on a page where all the download links looked something like this: filedownload.ashx?name=file.mp3
. The solution was to match for both the linked file, and the downloaded file. So my wget accept flag looked like this: -A 'ashx,mp3'
. I also used the --trust-server-names
flag. This catches all the .ashx
that are linked in the webpage, then when wget does the second check, all the mp3
files that were downloaded will stay.
As an alternative to --trust-server-names
, you may also find the --content-disposition
flag helpful. Both flags help rename the file that gets downloaded from filedownload.ashx?name=file.mp3
to just file.mp3
.
Upvotes: 0
Reputation: 312530
Possibly you're hitting the same problem I've hit when trying to do something similar. When using --accept
, wget
determines whether a links refers to a file or directory based on whether or not it ends with a /
.
For example, say I have a directory named files
, and a web page that has:
<a href="files">Lots o' files!</a>
If I were to request this with wget -r
, then I wget
would happily GET /files
, see that it was an HTML document containing a bunch of links, and continue to download those links.
However, if I add -A zip
to my command line, and run wget
with --debug
, I see:
appending ‘http://localhost:8080/files’ to urlpos.
[...]
Deciding whether to enqueue "http://localhost:8080/files".
http://localhost:8080/files (files) does not match acc/rej rules.
Decided NOT to load it.
In other words, wget
thinks this is a file (no trailing /
) and it doesn't match our acceptance criteria, so it gets rejected.
If I modify the remote file so that it looks like...
<a href="files/">Lots o' files!</a>
...then wget
will follow the link and download files as desired.
I don't think there's a great solution to this problem if you need to use wget
. As I mentioned in my comment, there are other tools available that may handle this situation more gracefully.
It's also possible you're experiencing a different issue; the output of adding --debug
to your command line clarify things in that case.
Upvotes: 1