Reputation: 460
I want to download the contents of a website where the URLs are built as
http://www.example.com/level1/level2?option1=1&option2=2
Within the URL only the http://www.example.com/level1/level2 is unique for each page, and the values for option1 and option2 are changing. In fact, every unique page can have hundreds of different notations due to these variables. I am using wget to fetch all the site's content. Because of the problem I already downloaded more than 3GB of data. Is there a way to tell wget to ignore everything behind the URL's question mark? I can't find it in the man pages.
Upvotes: 27
Views: 12861
Reputation: 938
wget2 has this built in via options --cut-url-get-vars
and --cut-file-get-vars
.
Upvotes: 5
Reputation: 7616
@kenorb's answer using --reject-regex
is good. It did not work in my case though on an older version of wget
. Here is the equivalent using wildcards that works with GNU Wget 1.12:
wget --reject "*\?*" -m -c --content-disposition http://example.com/
Upvotes: 0
Reputation: 11424
It does not help in your case, but for those who have already downloaded all of these files. You can quickly rename the files to remove the question mark and everything after it as follows:
rename -v -n 's/[?].*//' *[?]*
The above command does a trial run and shows you how files will be renamed. If everything looks good with the trial run, then run the command again without the -n (nono) switch.
Upvotes: 1
Reputation: 166843
You can use --reject-regex
to specify the pattern to reject the specific URL addresses, e.g.
wget --reject-regex "(.*)\?(.*)" -m -c --content-disposition http://example.com/
This will mirror the website, but it'll ignore the addresses with question mark - useful for mirroring wiki sites.
Upvotes: 38
Reputation: 460
Problem solved. I noticed that the URLs that i want to download are all search engine friendly, where descriptions were formed using a dash:
http://www.example.com/main-topic/whatever-content-in-this-page
All other URLs had references to the CMS. I got all I neede with
wget -r http://www.example.com -A "*-*"
This did the trick. Thanks for thought sharing!
Upvotes: 0