Mojtaba
Mojtaba

Reputation: 316

How to obtain the URL of re-directed webpage in C++

I did write a c++ code which automatically parses a webpage and open and parse some of their links. The point is that in these webpage there are some addresses which were redirected to other webpages. For example, when I try to open:

https://atlas.immobilienscout24.de/property-by-address?districtId=1276001006014

I ended up opening:

https://atlas.immobilienscout24.de/orte/deutschland/baden-württemberg/böblingen-kreis/leonberg

How could I get the url of the second page in C++?

Upvotes: 0

Views: 86

Answers (2)

hanshenrik
hanshenrik

Reputation: 21483

you could use CURLOPT_HEADERFUNCTION to inspect the headers and parse out the Location header, eg

#include <iostream>
#include <cstring>
#include <curl/curl.h>
size_t header_callback(char *buffer,   size_t size,   size_t nitems,   void *userdata){
  const std::string needle="Location: ";
  if(nitems>needle.size()){
    if(std::memcmp(&needle[0],buffer,needle.size()) == 0 ){
      //todo: verify that im not off-by-one below.
      ((std::string*)userdata)->assign(&buffer[needle.size()],nitems-needle.size());
    }
  }
  return nitems;
}
int main(int argc, char *argv[])
{
  CURLcode ret;
  CURL *hnd = curl_easy_init();
  curl_easy_setopt(hnd, CURLOPT_URL, "https://atlas.immobilienscout24.de/property-by-address?districtId=1276001006014");
  curl_easy_setopt(hnd, CURLOPT_NOPROGRESS, 1L);
  curl_easy_setopt(hnd, CURLOPT_NOBODY, 1L);
  std::string redirect_url;
  curl_easy_setopt(hnd,CURLOPT_HEADERDATA,&redirect_url);
  curl_easy_setopt(hnd,CURLOPT_HEADERFUNCTION,header_callback);
  ret = curl_easy_perform(hnd);
  curl_easy_cleanup(hnd);
  hnd = NULL;
std::cout << redirect_url;
  return (int)ret;
}

.. but if you want the final url (in case of multiple redirects), rather than just "the second url", you should probably use CURLOPT_FOLLOWLOCATION and CURLINFO_EFFECTIVE_URL instead, eg

#include <iostream>
#include <cstring>
#include <curl/curl.h>
int main(int argc, char *argv[])
{
  CURLcode ret;
  CURL *hnd = curl_easy_init();
  curl_easy_setopt(hnd, CURLOPT_URL, "https://atlas.immobilienscout24.de/property-by-address?districtId=1276001006014");
  curl_easy_setopt(hnd, CURLOPT_NOPROGRESS, 1L);
  curl_easy_setopt(hnd, CURLOPT_NOBODY, 1L);
  curl_easy_setopt(hnd,CURLOPT_FOLLOWLOCATION,1L);
  ret = curl_easy_perform(hnd);
  char *lolc;
  curl_easy_getinfo(hnd, CURLINFO_EFFECTIVE_URL, &lolc);
  std::string final_url(lolc);
  curl_easy_cleanup(hnd);
  hnd = NULL;
  std::cout << final_url;
  return (int)ret;
}

this approach is slower (have to do at least 1 more request upon redirect), but much simpler to implement and works on both redirected urls and non-redirected urls and multiple-redirected-urls alike.

Upvotes: 1

Lightness Races in Orbit
Lightness Races in Orbit

Reputation: 385144

In that particular case, it's given by the Location header in a 301 ("Moved Permanently") response (according to Chrome's Developer Tools).

If you set FOLLOWLOCATION to 0, you can prevent libcurl from following redirects, and then just examine the headers of the original response (or, probably better, query REDIRECT_URL for the information).

(Then you can perform a new request to the alternative URL, if you like.)

The default for this is 0, though, so you must be setting it to 1 yourself currently.

Upvotes: 3

Related Questions