CiaranWelsh
CiaranWelsh

Reputation: 7681

How to download compressed files using the curl C API?

I want to download a compressed file from a URL using libcurl C API. I have the following code:

// CurlGet.h

#include <cstddef>
#include <cstdlib>
#include <cstring>
#include <curl/curl.h>


struct memory {
    char *response;
    size_t size;
};

size_t callBackWrite(void *data, size_t size, size_t nmemb, void *userp) {
    size_t written = fwrite(data, size, nmemb, (FILE *) userp);
    return written;
}

int curlGetC(const char *url, const char* output_filename) {
    CURL *curl_handle;

    curl_global_init(CURL_GLOBAL_ALL);

    /* init the curl session */
    curl_handle = curl_easy_init();
    if (!curl_handle) {
        throw std::logic_error("You no curl");
    }

    /* set URL to get here */
    curl_easy_setopt(curl_handle, CURLOPT_URL, url);

    /* Switch on full protocol/debug output while testing */
    curl_easy_setopt(curl_handle, CURLOPT_VERBOSE, 1L);

    /* disable progress meter, set to 0L to enable it */
    curl_easy_setopt(curl_handle, CURLOPT_NOPROGRESS, 0L);

    /* send all data to this function  */
    curl_easy_setopt(curl_handle, CURLOPT_WRITEFUNCTION, callBackWrite);

    /* open the file */
    FILE *f = fopen(output_filename, "wb");
    if (!f) {
        throw std::invalid_argument("You no got file");
    }

    /* write the page body to this file handle */
    curl_easy_setopt(curl_handle, CURLOPT_WRITEDATA, f);

    /* get it! */
    curl_easy_perform(curl_handle);

    /* close the header file */
    fclose(f);

    /* cleanup curl stuff */
    curl_easy_cleanup(curl_handle);

    curl_global_cleanup();
    return 0;
}

Then using this code to download a web page works as expected but downloading an omex file (which is actually just a zip file with the omex extension name) does not:


#include "CurlGet.h"
#include <iostream>

// works as expected
std::string url1 = "https://isocpp.org/wiki/faq/mixing-c-and-cpp";
std::string output_filename1 = "/mnt/d/libsemsim/semsim/example.html";
curlGetC(url1_.c_str(), output_filename1_.c_str());

// downloaded file is 0 bytes.
std::string url2 = "https://auckland.figshare.com/ndownloader/files/17432333";
std::string output_filename2 = "/mnt/d/libsemsim/semsim/example.omex";
curlGetC(url2_.c_str(), output_filename2_.c_str());

Could anybody suggest how to modify my code to get it to download the compressed file?

edit : Showing the verbose traces:

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 52.48.88.255...
* TCP_NODELAY set
* Connected to auckland.figshare.com (52.48.88.255) port 443 (#0)
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server accepted to use http/1.1
* Server certificate:
*  subject: C=GB; L=London; O=figshare LLP; CN=*.figshare.com
*  start date: Mar 20 00:00:00 2019 GMT
*  expire date: Jul  9 12:00:00 2020 GMT
*  subjectAltName: host "auckland.figshare.com" matched cert's "*.figshare.com"
*  issuer: C=US; O=DigiCert Inc; CN=DigiCert SHA2 Secure Server CA
*  SSL certificate verify ok.
> GET /ndownloader/files/17432333 HTTP/1.1
Host: auckland.figshare.com
Accept: */*

< HTTP/1.1 302 Found
< Date: Sun, 12 Apr 2020 10:43:10 GMT
< Content-Type: application/octet-stream
< Content-Length: 0
< Connection: keep-alive
< Server: nginx
< X-Storage-Protocol: https
< X-Filename: BIOMD0000000204_new.omex
< Location: https://objectext.auckland.ac.nz/figshare/17432333/BIOMD0000000204_new.omex
< X-Storage-Host: objectext.auckland.ac.nz
< X-Storage-File: 17432333/BIOMD0000000204_new.omex
< X-Storage-Bucket: figshare
< Content-Disposition: attachment;filename=BIOMD0000000204_new.omex
< Cache-Control: no-cache, no-store
< Set-Cookie: fig_tracker_client=0975a192-4ec5-4a63-a800-c598eb7ca6b5; Max-Age=31536000; Path=/; expires=Mon, 12-Apr-2021 10:43:10 GMT; secure; HttpOnly
< X-Robots-Tag: noindex
< X-Frame-Options: SAMEORIGIN
< X-XSS-Protection: 1; mode=block
< Strict-Transport-Security: max-age=31536000; includeSubDomains;
< Cache-Control: public, must-revalidate, proxy-revalidate
< Access-Control-Allow-Credentials: true
< Access-Control-Allow-Methods: GET, OPTIONS
< Access-Control-Allow-Headers: Keep-Alive,User-Agent,X-Requested-With,If-Modified-Since,Cache-Control,Content-Type,Authorization,Range
< Access-Control-Expose-Headers: Location,Accept-Ranges,Content-Encoding,Content-Length,Content-Range
< 
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0
* Connection #0 to host auckland.figshare.com left intact

Upvotes: 0

Views: 633

Answers (2)

rici
rici

Reputation: 241721

This really has nothing to do with the fact that the traget file is compressed. Zip files are archives whose components are compressed individually; it is not possible to decompress a zip file into a single meaningful object. That's different from gzipped tar archives, for example. (However it is not generally desirable for a user agent to automatically decompress a .tgz file into a .tar file, even though it could.)

Your problem stems from the fact that you didn't provide the full URI for the file. The web server responded by sending a redirect (302) return code. That tells the user agent to make a new request for the resource, using the URI provided in the Location response header.

You need to tell libcurl to follow redirects.

curl_easy_setopt(curl_handle, CURLOPT_FOLLOWLOCATION, 1L);

302 redirects differ from 301 redirects in that the redirection is marked as temporary. The 301 return code suggests to the user agent that it should remember the redirection and not attempt to use the original URL in the future. A 302 response should not be cached; it might, for example, be used to provide the location of what is currently the most recent version of a resource.

Upvotes: 1

hanshenrik
hanshenrik

Reputation: 21483

here is (probably) what happened:

You sent a request without the Accept-Encoding header, the server (foolishly, imo) assumed that since you didn't specify any specific transfer encodings, you probably support gzip.. (sounds stupid, i know, but the proper way to say "i dont support any transfer encodings" is to send the header Accept-Encoding: identity, but you didn't do that), and the server decided to answer with Content-Encoding: gzip, which your code ignored. what happens next is that gzip-compressed data was saved in your "output_filename".

to tell curl to automatically deal with encodings (which is the easiest solution, the vast majority of the time), just set CURLOPT_ACCEPT_ENCODING to emptystring, this tells curl to attempt to do the tansfer compressed, and automatically decompress the response before writing it:

curl_easy_setopt(curl_handle, CURLOPT_ACCEPT_ENCODING, "");

that should fix your problem. now curl will send a header looking like Accept-Encoding: gzip, deflate, br (the exact compression algorithms sent will depend on what your libcurl was compiled to support), and the server will choose 1 of those encodings, or if the server doesn't support any of the encodings your libcurl supports, the server should send the data uncompressed,

and curl in turn will auto-decompress the data before sending it to CURLOPT_WRITEFUNCTION

you can find relevant documentation here: https://curl.haxx.se/libcurl/c/CURLOPT_ACCEPT_ENCODING.html

Upvotes: 0

Related Questions