Spekuloos
Spekuloos

Reputation: 21

Capture a download link redirected by a page (WGET)

Here is my problem.

I am currently working on a script for me that automates the download of some software that I use to "clean" my computer.

I have been able to make downloads with download URLs like this one: "https://www.driverscloud.com/plugins/DriversCloud_Win.exe" but not with URLs that redirect to a download URL after a short time of waiting like this one: "https://www.ccleaner.com/fr-fr/ccleaner/download/standard".

I can see that the problem is that I don't give a direct download address to Wget but I would like to be able to do it with the address "https://www.ccleaner.com/fr-fr/ccleaner/download/standard" because Piriform (the developer of Ccleaner) updates the software quite regularly and the download address changes according to the version number (example: https://download.ccleaner.com/ccsetup547.exe -> https://download.ccleaner.com/ccsetup548.exe).

So how can I ask Wget to take the download link contained in the page and not download the page itself (because I get a file called "standard" like at the end of the URL "https://www.ccleaner.com/fr-fr/ccleaner/download/standard" ?

I would be delighted if you have a solution for me with Wget or other tools like Curl :) .

Thank you in advance.

Upvotes: 2

Views: 2803

Answers (2)

darnir
darnir

Reputation: 5180

You don't need PHP for that. wget alone is powerful enough to do this simple job :)

Here's the command you need (I'll give a breakdown below):

$ wget -r -l 1 --span-hosts --accept-regex='.*download.ccleaner.com/.*.exe' -erobots=off -nH https://www.ccleaner.com/fr-fr/ccleaner/download/standard

Now, for a breakdown of what this does:

  • -r: Enables recursion since we want to follow a link on the provided page
  • -l 1: We want to recurse only one level deep since the required URL is on the same page
  • --span-hosts: The required file is on a different host than the original URL we provide. So we ask wget to go across hosts when using recursion
  • --accept-regex=...: This specifies a regular expression of the links that will be accessed through recursion. Since we only want one file and know the pattern, we make pretty specific regex.
  • -erobots=off: The download.ccleaner.com host has a robots.txt that forbids all user-agents. But we're not crawling the domain, so disable honoring the robots file
  • -nH: Don't create host specific directories. This means the exe will be downloaded directly into your current folder now.

If you want a little more automation, you can also append a && rm -r fr-fr/ to the above command to remove the base page that you downloaded in order to get the right link.

Enjoy!

EDIT: Since OP is on Windows, here is an updated command specifically for running on Windows. It doesn't single-quote the regex string since that causes the Windows shell to pass the regex as a string with single quotes.

$ wget -r -l 1 --span-hosts --accept-regex=.*download.ccleaner.com/.*.exe -erobots=off -nH https://www.ccleaner.com/fr-fr/ccleaner/download/standard

Upvotes: 2

hanshenrik
hanshenrik

Reputation: 21513

wget spider mode might be able to do it, but this isn't a job for either curl nor wget, you need to fetch the download page, and then extract the download url to the newest version from that html, some pages also provide a cookie in the download page, and requires you to submit this cookie to download the actual file, this is a job for a language that understands HTTP and HTML. PHP is one such language, taking ccleaner's download page as an example:

#!/usr/bin/env php
<?php
$ch = curl_init("https://www.ccleaner.com/fr-fr/ccleaner/download/standard");
curl_setopt_array($ch, array(
    CURLOPT_COOKIEFILE => '',
    CURLOPT_ENCODING => '',
    CURLOPT_RETURNTRANSFER => 1,
    CURLOPT_SSL_VERIFYPEER => 0
));
$html = curl_exec($ch);
$domd = @DOMDocument::loadHTML($html);
$xp = new DOMXPath($domd);
$download_element = $xp->query('//a[contains(text(),"start the download")]')->item(0);
$download_url = $download_element->getAttribute("href");
$download_name = basename($download_url); // fetching it from the headers of the download would be more reliable but cba
echo "download name: \"{$download_name}\" - url: {$download_url}\n";
curl_setopt($ch, CURLOPT_URL, $download_url);
$installer_binary = curl_exec($ch);
file_put_contents($download_name, $installer_binary);

this script fetches the download page, then extracts the "href" (url) attribute of the <a href="download_url">start the download</a> element containing the text start the download, then downloads whatever that url points to. this is beyond the scope of wget/curl, use a scripting language.

enter image description here

Upvotes: 0

Related Questions