Reputation: 316
I did write a c++ code which automatically parses a webpage and open and parse some of their links. The point is that in these webpage there are some addresses which were redirected to other webpages. For example, when I try to open:
https://atlas.immobilienscout24.de/property-by-address?districtId=1276001006014
I ended up opening:
https://atlas.immobilienscout24.de/orte/deutschland/baden-württemberg/böblingen-kreis/leonberg
How could I get the url of the second page in C++?
Upvotes: 0
Views: 86
Reputation: 21483
you could use CURLOPT_HEADERFUNCTION to inspect the headers and parse out the Location
header, eg
#include <iostream>
#include <cstring>
#include <curl/curl.h>
size_t header_callback(char *buffer, size_t size, size_t nitems, void *userdata){
const std::string needle="Location: ";
if(nitems>needle.size()){
if(std::memcmp(&needle[0],buffer,needle.size()) == 0 ){
//todo: verify that im not off-by-one below.
((std::string*)userdata)->assign(&buffer[needle.size()],nitems-needle.size());
}
}
return nitems;
}
int main(int argc, char *argv[])
{
CURLcode ret;
CURL *hnd = curl_easy_init();
curl_easy_setopt(hnd, CURLOPT_URL, "https://atlas.immobilienscout24.de/property-by-address?districtId=1276001006014");
curl_easy_setopt(hnd, CURLOPT_NOPROGRESS, 1L);
curl_easy_setopt(hnd, CURLOPT_NOBODY, 1L);
std::string redirect_url;
curl_easy_setopt(hnd,CURLOPT_HEADERDATA,&redirect_url);
curl_easy_setopt(hnd,CURLOPT_HEADERFUNCTION,header_callback);
ret = curl_easy_perform(hnd);
curl_easy_cleanup(hnd);
hnd = NULL;
std::cout << redirect_url;
return (int)ret;
}
.. but if you want the final url (in case of multiple redirects), rather than just "the second url", you should probably use CURLOPT_FOLLOWLOCATION and CURLINFO_EFFECTIVE_URL instead, eg
#include <iostream>
#include <cstring>
#include <curl/curl.h>
int main(int argc, char *argv[])
{
CURLcode ret;
CURL *hnd = curl_easy_init();
curl_easy_setopt(hnd, CURLOPT_URL, "https://atlas.immobilienscout24.de/property-by-address?districtId=1276001006014");
curl_easy_setopt(hnd, CURLOPT_NOPROGRESS, 1L);
curl_easy_setopt(hnd, CURLOPT_NOBODY, 1L);
curl_easy_setopt(hnd,CURLOPT_FOLLOWLOCATION,1L);
ret = curl_easy_perform(hnd);
char *lolc;
curl_easy_getinfo(hnd, CURLINFO_EFFECTIVE_URL, &lolc);
std::string final_url(lolc);
curl_easy_cleanup(hnd);
hnd = NULL;
std::cout << final_url;
return (int)ret;
}
this approach is slower (have to do at least 1 more request upon redirect), but much simpler to implement and works on both redirected urls and non-redirected urls and multiple-redirected-urls alike.
Upvotes: 1
Reputation: 385144
In that particular case, it's given by the Location
header in a 301 ("Moved Permanently") response (according to Chrome's Developer Tools).
If you set FOLLOWLOCATION
to 0
, you can prevent libcurl from following redirects, and then just examine the headers of the original response (or, probably better, query REDIRECT_URL
for the information).
(Then you can perform a new request to the alternative URL, if you like.)
The default for this is 0
, though, so you must be setting it to 1
yourself currently.
Upvotes: 3