Reputation: 89
I am trying to scrape all event links from https://www.tapology.com/fightcenter. Have already quite some experience in webscraping using R but in this case I am stuck.
I am able to scrape from page 1, however when I input a second page as URL, I still obtain data from first page as if the page is being redirected back automatically.
I have tried various codes found here on the forum, still, something is wrong.
First page
url = "https://www.tapology.com/fightcenter"
html <- paste(readLines(url), collapse="\n")
library(stringr)
matched <- str_match_all(html, "<a href=\"(.*?)\"")
matched = as.data.frame(matched[[1]], stringsAsFactors = F)
Second page
url = 'https://www.tapology.com/fightcenter_events?page=2'
html <- paste(readLines(url), collapse="\n")
library(stringr)
matched <- str_match_all(html, "<a href=\"(.*?)\"")
matched = as.data.frame(matched[[1]], stringsAsFactors = F)
Results are identical. Could you please help me to solve this?
Thank you
Upvotes: 0
Views: 108
Reputation: 2213
I have been able to extract the first three links of the first three pages with the following code :
library(RSelenium)
shell('docker run -d -p 4445:4444 selenium/standalone-firefox')
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L, browserName = "firefox")
remDr$open()
remDr$navigate("https://www.tapology.com/fightcenter")
list_Matched <- list()
# We get results from first 3 pages
for(i in 1 : 3)
{
print(i)
if(i != 1)
{
# Press on next button
web_Elem_Link <- remDr$findElement("class name", "next")
web_Elem_Link$clickElement()
}
list_Link_Page <- list()
Sys.sleep(3)
# Get the first three links of the page ...
for(j in 1 : 3)
{
web_Elem_Link <- tryCatch(remDr$findElement("xpath", paste0('//*[@id="content"]/div[4]/section[', j, ']/div/div[1]/div[1]/span[1]/a')),
error = function(e) NA)
if(is.na(web_Elem_Link))
{
web_Elem_Link <- remDr$findElement("xpath", paste0('//*[@id="content"]/div[3]/section[', j, ']/div/div[1]/div[1]/span[1]/a'))
}
web_Elem_Link$clickElement()
Sys.sleep(3)
list_Link_Page[[j]] <- remDr$getCurrentUrl()
remDr$goBack()
Sys.sleep(3)
}
list_Matched[[i]] <- list_Link_Page
}
Upvotes: 0
Reputation: 84465
Content is added dynamically via xhr. You can use httr (as mentioned in other answer) and add your headers. You also need to alter the page param that goes in the url during a loop/sequence. An example is shown below of a single request for a different page is shown (I just extract the fight links of person 1 v person 2 to show it is reading from that page). You could alter this to be a function returning info of interest in your loop or perhaps use purrr to map info across to an existing structure.
require(httr)
require(rvest)
require(magrittr)
require(stringr)
headers = c(
'User-Agent' = 'Mozilla/5.0',
'Accept' = 'text/javascript, application/javascript, application/ecmascript, application/x-ecmascript, */*; q=0.01',
'X-Requested-With' = 'XMLHttpRequest'
)
params = list(
'page' = '2'
)
r <- httr::GET(url = 'https://www.tapology.com/fightcenter_events', httr::add_headers(.headers=headers), query = params)
x <- str_match_all(content(r,as="text") ,'html\\("(.*>)')
y <- gsub('"',"'",gsub('\\\\','', x[[1]][,2]))
z <- read_html(y) %>% html_nodes(., ".billing a") %>% html_attr(., "href")
Upvotes: 1
Reputation: 1091
You're getting redirected back, because the website checking the headers which are you sending. For getting a correct data, you need to have this headers set:
Accept
: text/javascript, application/javascript, application/ecmascript, application/x-ecmascript, */*; q=0.01
X-Requested-With
: XMLHttpRequest
Also, this request doesn't return the HTML of the webpage, but jQuery code, which updating the list on the website dynamically.
Upvotes: 0