S. Sharp
S. Sharp

Reputation: 43

wget recursive fails on wiki pages

I'm trying to recursively fetch all pages linked from a Moin wiki page. I've tried many different wget recursive options, which all have the same result: only the html file from the given URL gets downloaded, not any of the pages linked from that html page.

If I use the --convert-links option, wget correctly translates the unfetched links to the right web links. It just doesn't recursively download those linked pages.

wget --verbose -r https://wiki.gnome.org/Outreachy
--2017-03-02 10:34:03--  https://wiki.gnome.org/Outreachy
Resolving wiki.gnome.org (wiki.gnome.org)... 209.132.180.180, 209.132.180.168
Connecting to wiki.gnome.org (wiki.gnome.org)|209.132.180.180|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘wiki.gnome.org/Outreachy’

wiki.gnome.org/Outreachy                                      [  <=>                                                                                                                                ]  52.80K   170KB/s    in 0.3s    

2017-03-02 10:34:05 (170 KB/s) - ‘wiki.gnome.org/Outreachy’ saved [54064]

FINISHED --2017-03-02 10:34:05--
Total wall clock time: 1.4s
Downloaded: 1 files, 53K in 0.3s (170 KB/s)

I'm not sure if it's failing because the wiki's html links don't end with .html. I've tried using various combinations of --accept='[a-zA-Z0-9]+', --page-requisites, and --accept-regex='[a-zA-Z0-9]+' to work around that, no luck.

I'm not sure if it's failing because the wiki has html pages like https://wiki.gnome.org/Outreachy that links page URLs like https://wiki.gnome.org/Outreachy/Admin and https://wiki.gnome.org/Outreachy/Admin/GettingStarted. Maybe wget is confused because there will need to be an HTML page and a directory with the same name? I also tried using --nd but no luck.

The linked html pages are all relative to the base wiki URL (e.g. <a href="/Outreachy/History">Outreachy history page</a>). I've tried also adding --base="https://wiki.gnome.org/ with no luck.

At this point, I've tried a whole lot of different wget options, read several stack overflow and unix.stackexchange.com questions, and nothing I've tried has worked. I'm hoping there's a wget expert that can look at this particular wiki page and figure why wget is failing to recursively fetch linked pages. The same options work fine on other domains.

I've also tried httrack, with the same result. I'm running Linux, so please don't suggest Windows or proprietary tools.

Upvotes: 1

Views: 731

Answers (1)

ehabkost
ehabkost

Reputation: 421

This seems to be caused by the following tag in the wiki:

<meta name="robots" content="index,nofollow">

If you are sure you want to ignore the tag, you can make wget ignore it using -e robots=off:

wget -e robots=off --verbose -r https://wiki.gnome.org/Outreachy

Upvotes: 0

Related Questions