Tim496
Tim496

Reputation: 162

Capturing redirected URLS in R

Suppose i have an R script and a long list of URLs (100k+), what is the most efficient method of capturing the final redirected URL?

For example: if you ping "www.someurl.com" and it redirects to "www.someurl.com/homepage" then i'd like to record it into a dataframe.

I tried using the HEAD function from httr but didn't seem to get what i wanted, i.e.:

getCanonicalURLs <- function(url) {
 canonicalURL <- HEAD(url)
}

urlRedirects <- lapply(as.character(urlList), getCanonicalURLs)

Upvotes: 1

Views: 1331

Answers (2)

ishonest
ishonest

Reputation: 493

Here's a function which checks HTTP status (=200) and returns re-directed urls. I wanted something to work with dplyr::mutate and this function does that.

getCanonicalURLs <- function(urls) 
{
  op <- rep(NA, length(urls))
  for(i in 1:length(urls))
  {op[i] <- tryCatch({if(crul::ok(urls[i], info = F)) {httr::HEAD(urls[i])$url} else {NA} }
                     , error = function(e){NA}
                     , warning = function(w){NA})}
  rm(i)
  return(op)
}

Upvotes: 0

Martin Schmelzer
Martin Schmelzer

Reputation: 23909

I think you can go with base::curlGetHeaders():

curlGetHeaders("www.ard.de")
 [1] "HTTP/1.1 301 Moved Permanently\r\n"                                      
 [2] "Server: Apache\r\n"                                                      
 [3] "Location: http://www.ard.de/home/ard/ARD_Startseite/21920/index.html\r\n"
 [4] "Content-Length: 328\r\n"
 ...   

Then just get the element that starts with "Location".

stringr::str_extract(grep(curlGetHeaders("www.ard.de"), pattern = "Location", value = T), pattern = "http://.*")

Upvotes: 2

Related Questions