Reputation: 162
Suppose i have an R script and a long list of URLs (100k+), what is the most efficient method of capturing the final redirected URL?
For example: if you ping "www.someurl.com" and it redirects to "www.someurl.com/homepage" then i'd like to record it into a dataframe.
I tried using the HEAD function from httr but didn't seem to get what i wanted, i.e.:
getCanonicalURLs <- function(url) {
canonicalURL <- HEAD(url)
}
urlRedirects <- lapply(as.character(urlList), getCanonicalURLs)
Upvotes: 1
Views: 1331
Reputation: 493
Here's a function which checks HTTP status (=200) and returns re-directed urls. I wanted something to work with dplyr::mutate
and this function does that.
getCanonicalURLs <- function(urls)
{
op <- rep(NA, length(urls))
for(i in 1:length(urls))
{op[i] <- tryCatch({if(crul::ok(urls[i], info = F)) {httr::HEAD(urls[i])$url} else {NA} }
, error = function(e){NA}
, warning = function(w){NA})}
rm(i)
return(op)
}
Upvotes: 0
Reputation: 23909
I think you can go with base::curlGetHeaders()
:
curlGetHeaders("www.ard.de")
[1] "HTTP/1.1 301 Moved Permanently\r\n"
[2] "Server: Apache\r\n"
[3] "Location: http://www.ard.de/home/ard/ARD_Startseite/21920/index.html\r\n"
[4] "Content-Length: 328\r\n"
...
Then just get the element that starts with "Location".
stringr::str_extract(grep(curlGetHeaders("www.ard.de"), pattern = "Location", value = T), pattern = "http://.*")
Upvotes: 2