Jake McGraw
Jake McGraw

Reputation: 56126

Scrape multi-frame website

I'm auditing our existing web application, which makes heavy use of HTML frames. I would like to download all of the HTML in each frame, is there a method of doing this with wget or a little bit of scripting?

Upvotes: 3

Views: 5574

Answers (3)

Zebra North
Zebra North

Reputation: 11492

wget has a -r option to make it recursive, try wget -r -l1 (in case the font makes it hard to read: that last part is a lower case L followed by a number one) The -l1 part tells it to recurse to a maximum depth of 1. Try playing with this number to scrape more.

Upvotes: 1

JustinD
JustinD

Reputation: 1676

as an addition to Steve's answer:

Span to any host—‘-H’

The ‘-H’ option turns on host spanning, thus allowing Wget's recursive run to visit any host referenced by a link. Unless sufficient recursion-limiting criteria are applied depth, these foreign hosts will typically link to yet more hosts, and so on until Wget ends up sucking up much more data than you have intended.

Limit spanning to certain domains—‘-D’

The ‘-D’ option allows you to specify the domains that will be followed, thus limiting the recursion only to the hosts that belong to these domains. Obviously, this makes sense only in conjunction with ‘-H’.

A typical example would be downloading the contents of ‘www.server.com’, but allowing downloads from ‘images.server.com’, etc.:

      wget -rH -Dserver.com http://www.server.com/

You can specify more than one address by separating them with a comma,

e.g. ‘-Ddomain1.com,domain2.com’.

taken from: wget manual

Upvotes: 6

Steve Moyer
Steve Moyer

Reputation: 5733

wget --recursive --domains=www.mysite.com http://www.mysite.com

Which indicates a recursive crawl should also traverse into frames and iframes. Be careful to limit the scope of recursion only to your web site since you probably don't want to crawl the whole web.

Upvotes: 1

Related Questions