Lindsay Lohan
Lindsay Lohan

Reputation: 81

How to prevent file name "index.html?replytocom=xxx" in wget

Im trying to remove lots of strange file name, like index.html?replytocom=653, index.html?replytocom=667, etc.

Im using below code:

wget -k -m -r -q -R gif,png,jpg,jpeg,GIF,PNG,JPG,JPEG,?,= -t 1 http://www.website.com/

and tried also

wget -k -m -r -q -R gif,png,jpg,jpeg,GIF,PNG,JPG,JPEG,?,=,replytocom -t 1 http://www.website.com/

but no luck..

Upvotes: 2

Views: 1526

Answers (1)

kenorb
kenorb

Reputation: 166429

In this case, it's not possible to use rejlist, because the documentation for wget says:

Note, too, that query strings (strings at the end of a URL beginning with a question mark (`?`) are not included as part of the filename for accept/reject rules, even though these will actually contribute to the name chosen for the local file. It is expected that a future version of Wget will provide an option to allow matching against query strings.

Therefore you need to use --reject-regex parameter instead.

wget --reject-regex '(.*)\?(.*)' http://example.com

Beware that it seems you can use --reject-regex only once per wget call. That is, you have to use | in a single regex if you want to select on several regex :

wget --reject-regex 'expr1|expr2|…' http://example.com

So answering you question, I'm guessing the solution would be something like:

wget --reject-regex '(.*)replytocom(.*)' (...)

Upvotes: 2

Related Questions