Paul S.
Paul S.

Reputation: 4492

Wget span host only for images/stylesheets/javascript but not links

Wget has the -H "span host" option

Span to any host—‘-H’
The ‘-H’ option turns on host spanning, thus allowing Wget's recursive run to visit any host referenced by a link. Unless sufficient recursion-limiting criteria are applied depth, these foreign hosts will typically link to yet more hosts, and so on until Wget ends up sucking up much more data than you have intended. 

I want to do a recursive download (say, of level 3), and I want to get images, stylesheets, javascripts, etc. (that is, files necessary to display the page properly) even if they're outside my host. However, I don't want to follow a link to another HTML page (because then it can go to another HTML page, and so on, then the number can explode.)

Is it possible to do this somehow? It seems like the -H option controls spanning to other hosts for both the images/stylesheets/javascript case and the link case, and wget doesn't allow me to separate the two.

Upvotes: 27

Views: 8137

Answers (5)

EliuX
EliuX

Reputation: 12625

Just put wget -E -H -k -K -p -r http://<site>/ to download a complete site. Don't get nervous if while downloading you open some page and its resources are not available, because when wget finishes it all, it will convert them!

Upvotes: 3

user4401178
user4401178

Reputation:

Try using the wget --accept-regex flag; the posix --regex-type is compiled into wget standard but you can compile in the perl regex engine pcre if you need something more elaborate:

E.g. The following will get all pngs on external sites one level deep and any other pages that have the word google in the url: wget -r -H -k -l 1 --regex-type posix --accept-regex "(.*google.*|.*png)" "http://www.google.com"

It doesn't actually solve the problem of dipping down multiple levels on external sites, for that you would have to probably write your own spider. But using the --accept-regex you can probably get close to what you are looking for in most cases.

Upvotes: 1

anEffingChamp
anEffingChamp

Reputation: 166

Within a single layer of a domain you can check all links internally, and on third party servers with the following command.

wget --spider -nd -e robots=off -Hprb --level=1 -o wget-log -nv http://localhost

The limitation here is that it only checks a single layer. This works well with a CMS where you can flatten the site with the GET variable rather than CMS generated URLs. Otherwise you can use your favorite server side script to loop this command through directories. For a full explanation of all of the options, check out this Github commit.

https://github.com/jonathan-smalls-cc/git-hooks/blob/LAMP/contrib/pre-commit/crawlDomain.sh

Upvotes: 0

lightswitch05
lightswitch05

Reputation: 9428

Downloading All Dependencies in a page

First step is downloading all the resources of a particular page. If you look in the man pages for wget you will find this:

...to download a single page and all its requisites (even if they exist on separate websites), and make sure the lot displays properly locally, this author likes to use a few options in addition to -p:

wget -E -H -k -K -p http://<site>/<document>

Getting Multiple Pages

Unfortunately, that only works per-page. You can turn on recursion with -r, but then you run into the issue of following external sites and blowing up. If you know the full list of domains that could be used for resources, you can limit it to just those using -D, but that might be hard to do. I recommend using a combination of -np (no parent directories) and -l to limit the depth of the recursion. You might start getting other sites, but it will at least be limited. If you start having issues, you could use --exclude-domains to limit the known problem causers. In the end, I think this is best:

wget -E -H -k -K -p -np -l 1 http://<site>/level

Limiting the domains

To help figure out what domains need to be included/excluded you could use this answer to grep a page or two (you would want to grep the .orig file) and list the links within them. From there you might be able to build a decent list of domains that should be included and limit it using the -D argument. Or you might at least find some domains that you don't want included and limit them using --exclude-domains. Finally, you can use the -Q argument to limit the amount of data downloaded as a safeguard to prevent filling up your disk.

Descriptions of the Arguments

  • -E
    • If a file of type application/xhtml+xml or text/html is downloaded and the URL does not end with the regexp \.[Hh][Tt][Mm][Ll]?, this option will cause the suffix .html to be appended to the local filename.
  • -H
    • Enable spanning across hosts when doing recursive retrieving.
  • -k
    • After the download is complete, convert the links in the document to make them suitable for local viewing. This affects not only the visible hyperlinks, but any part of the document that links to external content, such as embedded images, links to style sheets, hyperlinks to non-HTML content, etc.
  • -K
    • When converting a file, back up the original version with a .orig suffix.
  • -p
    • This option causes Wget to download all the files that are necessary to properly display a given HTML page. This includes such things as inlined images, sounds, and referenced stylesheets.
  • -np
    • Do not ever ascend to the parent directory when retrieving recursively. This is a useful option, since it guarantees that only the files below a certain hierarchy will be downloaded.
  • -l
    • Specify recursion maximum depth level depth.
  • -D
    • Set domains to be followed. domain-list is a comma-separated list of domains. Note that it does not turn on -H.
  • --exclude-domains
    • Specify the domains that are not to be followed.
  • -Q
    • Specify download quota for automatic retrievals. The value can be specified in bytes (default), kilobytes (with k suffix), or megabytes (with m suffix).

Upvotes: 20

user829755
user829755

Reputation: 1588

for downloading all "files necessary to display the page properly" you can use -p or --page-requisites, perhaps together with -Q or --quota

Upvotes: 1

Related Questions