InsaneCoder
InsaneCoder

Reputation: 8268

Making wget to bypass index.html file

I am trying to download all the images from this link. I want to download images from only the hydraulics section, so I used --no-parent and when I run the command

wget -r --no-parent -e robots=off --user-agent="Mozilla/5.0 (Windows NT 5.1; rv:31.0) Gecko/20100101 Firefox/31.0" -A png http://indiabix.com/civil-engineering/hydraulics/

it only downloads the index.html.

I searched this issue on the web, and Stack Overflow already has two questions:

but they do not help. I also started a bounty on the latter question, but I wonder if anyone can suggest a workaround in my case?

Upvotes: 1

Views: 2414

Answers (2)

Alf Eaton
Alf Eaton

Reputation: 5463

The answer depends on knowing the path to the images folder, so that it can be added to the list of directories to be included (without the --include parameter the whole site will be fetched).

wget 'http://indiabix.com/civil-engineering/hydraulics/' --convert-links --adjust-extension --recursive --page-requisites --no-directories --directory-prefix=output --include '/civil-engineering/hydraulics','/_files/images'

Upvotes: 0

Axel Amthor
Axel Amthor

Reputation: 11096

Quite simple:

  • there are no images on the link you provided.

The tiny icons ("View Answer" etc.) are part of a CSS definition for the anchor (background-image). As per now, wget will not parse the external CSS and pick images from there.

With -A png wget will even stop at the first file (.html) since it doesn't match.

I've succeded downloading everything with

   lwp-rget --hier --nospace http://indiabix.com/civil-engineering/hydraulics/

The lwp CPAN perl packages need to be installed: zypper se libwww

Upvotes: 1

Related Questions