Reputation: 10441
this_page = read_html("https://apu.edu/athletics")
> this_page
{xml_document}
<html id="ctl00_html" lang="en" class=" index homepage">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n<script>window.client_hostname = "athletics.apu.edu";window.server_name = "79077 ...
[2] <body>\n<div style="position: fixed; left: -10000px"><script src="//cdn.blueconic.net/azusa.js" async=""></script></div>\n<script>(function(i,s,o,g,r,a,m){i[ ...
although we read https://apu.edu/athletics
, it redirects to athletics.apu.edu
. This is true both in the browser, and it can also be seen in the output of this_page
right here: <script>window.client_hostname = "athletics.apu.edu"; ...
Is it possible to extract this value out of the this_page
variable?
Edit: the currently top 3 answers (ekoam, David, Allan) all work, and all take ~ the same amount of time (0.35 seconds). I've accepted the answer with trace_redirects
because it provides the addt'l info of all the redirects...
Upvotes: 1
Views: 499
Reputation: 174293
If you want to get all the redirects (you are actually redirected twice here), you can use this function:
trace_redirects <- function(url) {
httr::GET(url)$all_headers %>%
lapply(function(x) x$headers$location) %>%
unlist() %>%
unique()
}
So you can do:
trace_redirects("https://apu.edu/athletics")
#> [1] "https://www.apu.edu/athletics" "http://athletics.apu.edu"
#> [3] "https://athletics.apu.edu/"
Upvotes: 1
Reputation: 8844
If you don't mind using httr
for this, then just:
httr::GET("https://apu.edu/athletics")[["url"]]
> httr::GET("https://apu.edu/athletics")[["url"]]
[1] "https://athletics.apu.edu/"
Upvotes: 4
Reputation: 10222
If you use html_session()
instead, it should work:
library(rvest)
url <- "https://apu.edu/athletics"
s <- html_session(url)
s
#> <session> https://athletics.apu.edu/
#> Status: 200
#> Type: text/html; charset=utf-8
#> Size: 221620
s$url
#> [1] "https://athletics.apu.edu/"
Upvotes: 1