Reputation: 21
Here is my problem.
I am currently working on a script for me that automates the download of some software that I use to "clean" my computer.
I have been able to make downloads with download URLs like this one: "https://www.driverscloud.com/plugins/DriversCloud_Win.exe" but not with URLs that redirect to a download URL after a short time of waiting like this one: "https://www.ccleaner.com/fr-fr/ccleaner/download/standard".
I can see that the problem is that I don't give a direct download address to Wget but I would like to be able to do it with the address "https://www.ccleaner.com/fr-fr/ccleaner/download/standard" because Piriform (the developer of Ccleaner) updates the software quite regularly and the download address changes according to the version number (example: https://download.ccleaner.com/ccsetup547.exe -> https://download.ccleaner.com/ccsetup548.exe).
So how can I ask Wget to take the download link contained in the page and not download the page itself (because I get a file called "standard" like at the end of the URL "https://www.ccleaner.com/fr-fr/ccleaner/download/standard" ?
I would be delighted if you have a solution for me with Wget or other tools like Curl :) .
Thank you in advance.
Upvotes: 2
Views: 2803
Reputation: 5180
You don't need PHP for that. wget
alone is powerful enough to do this simple job :)
Here's the command you need (I'll give a breakdown below):
$ wget -r -l 1 --span-hosts --accept-regex='.*download.ccleaner.com/.*.exe' -erobots=off -nH https://www.ccleaner.com/fr-fr/ccleaner/download/standard
Now, for a breakdown of what this does:
-r
: Enables recursion since we want to follow a link on the provided page-l 1
: We want to recurse only one level deep since the required URL is on the same page--span-hosts
: The required file is on a different host than the original URL we provide. So we ask wget to go across hosts when using recursion--accept-regex=...
: This specifies a regular expression of the links that will be accessed through recursion. Since we only want one file and know the pattern, we make pretty specific regex.-erobots=off
: The download.ccleaner.com
host has a robots.txt
that forbids all user-agents. But we're not crawling the domain, so disable honoring the robots file-nH
: Don't create host specific directories. This means the exe will be downloaded directly into your current folder now.If you want a little more automation, you can also append a && rm -r fr-fr/
to the above command to remove the base page that you downloaded in order to get the right link.
Enjoy!
EDIT: Since OP is on Windows, here is an updated command specifically for running on Windows. It doesn't single-quote the regex string since that causes the Windows shell to pass the regex as a string with single quotes.
$ wget -r -l 1 --span-hosts --accept-regex=.*download.ccleaner.com/.*.exe -erobots=off -nH https://www.ccleaner.com/fr-fr/ccleaner/download/standard
Upvotes: 2
Reputation: 21513
wget spider mode might be able to do it, but this isn't a job for either curl nor wget, you need to fetch the download page, and then extract the download url to the newest version from that html, some pages also provide a cookie in the download page, and requires you to submit this cookie to download the actual file, this is a job for a language that understands HTTP and HTML. PHP is one such language, taking ccleaner's download page as an example:
#!/usr/bin/env php
<?php
$ch = curl_init("https://www.ccleaner.com/fr-fr/ccleaner/download/standard");
curl_setopt_array($ch, array(
CURLOPT_COOKIEFILE => '',
CURLOPT_ENCODING => '',
CURLOPT_RETURNTRANSFER => 1,
CURLOPT_SSL_VERIFYPEER => 0
));
$html = curl_exec($ch);
$domd = @DOMDocument::loadHTML($html);
$xp = new DOMXPath($domd);
$download_element = $xp->query('//a[contains(text(),"start the download")]')->item(0);
$download_url = $download_element->getAttribute("href");
$download_name = basename($download_url); // fetching it from the headers of the download would be more reliable but cba
echo "download name: \"{$download_name}\" - url: {$download_url}\n";
curl_setopt($ch, CURLOPT_URL, $download_url);
$installer_binary = curl_exec($ch);
file_put_contents($download_name, $installer_binary);
this script fetches the download page, then extracts the "href" (url) attribute of the <a href="download_url">start the download</a>
element containing the text start the download
, then downloads whatever that url points to. this is beyond the scope of wget/curl, use a scripting language.
Upvotes: 0