Brian
Brian

Reputation: 13571

wget `--reject-regex` not working?

Why is the following command able to download index.html from www.example.com?

wget --reject-regex .* http://www.example.com/

$ wget --reject-regex .* http://www.example.com/
--2018-03-05 11:21:26--  http://.keystone_install_lock/
Resolving .keystone_install_lock... failed: nodename nor servname provided, or not known.
wget: unable to resolve host address ‘.keystone_install_lock’
--2018-03-05 11:21:26--  http://www.example.com/
Resolving www.example.com... 93.184.216.34
Connecting to www.example.com|93.184.216.34|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1270 (1.2K) [text/html]
Saving to: ‘index.html’

index.html                                                    100%[=================================================================================================================================================>]   1.24K  --.-KB/s    in 0s

2018-03-05 11:21:27 (4.49 MB/s) - ‘index.html’ saved [1270/1270]

FINISHED --2018-03-05 11:21:27--
Total wall clock time: 0.4s
Downloaded: 1 files, 1.2K in 0s (4.49 MB/s)

The man page of wget says

--accept-regex urlregex

--reject-regex urlregex

Specify a regular expression to accept or reject the complete URL.

and the regular expression .* matches everything. (You may verify this using freeformatter.com)

I think that everything wget downloads will be rejected because of --reject-regex .* option.

.* matches www.example.com, doesn't it?

Why doesn't wget ignore everything in www.example.com?

Upvotes: 4

Views: 3268

Answers (3)

Louis Strous
Louis Strous

Reputation: 1074

Part of the answer is that the .* in your command was likely expanded by your shell into a list of matching file names in your current working directory, because it isn't enclosed in appropriate quotes. The .keystone_install_lock in the output that you got is likely the name of a file in your current working directory. wget reports it before it even tries to connect to www.example.com. Try

wget --reject-regex '.*' http://www.example.com/

or perhaps with "" instead of '', depending on which shell you're using.

With that command I still get index.html retrieved, so my answer isn't complete.

With -np as suggested by Quantum7 I still get index.html, so that doesn't complete the answer, either.

Upvotes: 1

Quantum7
Quantum7

Reputation: 3315

Use the -np option to reject the index file. --reject-regex only applies to the recursive files (any links from the index file).

   -np
   --no-parent
       Do not ever ascend to the parent directory when retrieving recursively.
       This is a useful option, since it guarantees that only the
       files below a certain hierarchy will be downloaded.

Upvotes: 0

builder-7000
builder-7000

Reputation: 7627

--regect-regex will only reject URL links, not markup text in index.html. For example, if the website contains a URL to a CSS file main.css then this command will recursively download the website but exlude main.css:

wget -r --reject-regex 'main.css' www.somewebsite.com

To ignore some text from the website use sed. A couple of examples:

# Ignores the word 'Sans'
wget -qO- example.com | sed "s/Sans//g" > index.html

# Ignores everything
wget -qO- example.com | sed "s/.*//g" > index.html

Upvotes: 0

Related Questions