cootje
cootje

Reputation: 460

Using wget but ignore url parameters

I want to download the contents of a website where the URLs are built as

http://www.example.com/level1/level2?option1=1&option2=2

Within the URL only the http://www.example.com/level1/level2 is unique for each page, and the values for option1 and option2 are changing. In fact, every unique page can have hundreds of different notations due to these variables. I am using wget to fetch all the site's content. Because of the problem I already downloaded more than 3GB of data. Is there a way to tell wget to ignore everything behind the URL's question mark? I can't find it in the man pages.

Upvotes: 27

Views: 12861

Answers (5)

Jan Joneš
Jan Joneš

Reputation: 938

wget2 has this built in via options --cut-url-get-vars and --cut-file-get-vars.

Upvotes: 5

Peter Thoeny
Peter Thoeny

Reputation: 7616

@kenorb's answer using --reject-regex is good. It did not work in my case though on an older version of wget. Here is the equivalent using wildcards that works with GNU Wget 1.12:

wget --reject "*\?*" -m -c --content-disposition http://example.com/

Upvotes: 0

kojow7
kojow7

Reputation: 11424

It does not help in your case, but for those who have already downloaded all of these files. You can quickly rename the files to remove the question mark and everything after it as follows:

rename -v -n 's/[?].*//' *[?]*

The above command does a trial run and shows you how files will be renamed. If everything looks good with the trial run, then run the command again without the -n (nono) switch.

Upvotes: 1

kenorb
kenorb

Reputation: 166843

You can use --reject-regex to specify the pattern to reject the specific URL addresses, e.g.

wget --reject-regex "(.*)\?(.*)" -m -c --content-disposition http://example.com/

This will mirror the website, but it'll ignore the addresses with question mark - useful for mirroring wiki sites.

Upvotes: 38

cootje
cootje

Reputation: 460

Problem solved. I noticed that the URLs that i want to download are all search engine friendly, where descriptions were formed using a dash:

http://www.example.com/main-topic/whatever-content-in-this-page

All other URLs had references to the CMS. I got all I neede with

wget -r http://www.example.com -A "*-*"

This did the trick. Thanks for thought sharing!

Upvotes: 0

Related Questions