M. A.
M. A.

Reputation: 424

Use wget to crawl specific URLs

I am trying to crawl links from a website then use download manager to download files.

I've tried:

wget --wait=20 --limit-rate=20K -r -p -U Mozilla "www.mywebsite.com"

I can't figure out how to use wget or regular expressions to save the desired links only!

Upvotes: 1

Views: 2046

Answers (1)

Stephan
Stephan

Reputation: 43013

wget offers a wide variety of options for fine tuning files download in a recursive crawl.

Here are a few options that can interest you:

  • --accept-regex urlregex

Download any url matching urlregex. urlregex is a regular expression which is matched against the complete URL.

  • --reject-regex urlregex

Ignore any url matching urlregex. urlregex is a regular expression which is matched against the complete URL.

  • -L

Tells wget to follow only the relative links.

Relative links example:

<a href="foo.gif">
<a href="foo/bar.gif">
<a href="../foo/bar.gif">

Non relative links:

<a href="/foo.gif">
<a href="/foo/bar.gif">
<a href="http://www.server.com/foo/bar.gif">

References

Upvotes: 2

Related Questions