Reputation: 89

Data from page2 same as from page1 when scraping

I am trying to scrape all event links from https://www.tapology.com/fightcenter. Have already quite some experience in webscraping using R but in this case I am stuck.

I am able to scrape from page 1, however when I input a second page as URL, I still obtain data from first page as if the page is being redirected back automatically.

I have tried various codes found here on the forum, still, something is wrong.

First page

  url = "https://www.tapology.com/fightcenter"

    html <- paste(readLines(url), collapse="\n")
    library(stringr)
    matched <- str_match_all(html, "<a href=\"(.*?)\"")
    matched = as.data.frame(matched[[1]], stringsAsFactors = F)

Second page

  url = 'https://www.tapology.com/fightcenter_events?page=2'
  html <- paste(readLines(url), collapse="\n")
  library(stringr)
  matched <- str_match_all(html, "<a href=\"(.*?)\"")

  matched = as.data.frame(matched[[1]], stringsAsFactors = F)

Results are identical. Could you please help me to solve this?

Thank you

Upvotes: 0

Answers (3)

Emmanuel Hamel

Reputation: 2213

I have been able to extract the first three links of the first three pages with the following code :

library(RSelenium)
shell('docker run -d -p 4445:4444 selenium/standalone-firefox')
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L, browserName = "firefox")
remDr$open()
remDr$navigate("https://www.tapology.com/fightcenter")
list_Matched <- list()

# We get results from first 3 pages
for(i in 1 : 3)
{
  print(i)
 
  if(i != 1)
  {
    # Press on next button
    web_Elem_Link <- remDr$findElement("class name", "next")
    web_Elem_Link$clickElement()
  }

  list_Link_Page <- list()
  Sys.sleep(3)
  
  # Get the first three links of the page ...
  for(j in 1 : 3)
  {
    web_Elem_Link <- tryCatch(remDr$findElement("xpath", paste0('//*[@id="content"]/div[4]/section[', j, ']/div/div[1]/div[1]/span[1]/a')),
                              error = function(e) NA)
    
    if(is.na(web_Elem_Link))
    {
      web_Elem_Link <- remDr$findElement("xpath", paste0('//*[@id="content"]/div[3]/section[', j, ']/div/div[1]/div[1]/span[1]/a'))
    }
    
    web_Elem_Link$clickElement()
    Sys.sleep(3)
    list_Link_Page[[j]] <- remDr$getCurrentUrl()
    remDr$goBack()
    Sys.sleep(3)
  }
  
  list_Matched[[i]] <- list_Link_Page
}

Upvotes: 0

QHarr

Reputation: 84465

Content is added dynamically via xhr. You can use httr (as mentioned in other answer) and add your headers. You also need to alter the page param that goes in the url during a loop/sequence. An example is shown below of a single request for a different page is shown (I just extract the fight links of person 1 v person 2 to show it is reading from that page). You could alter this to be a function returning info of interest in your loop or perhaps use purrr to map info across to an existing structure.

require(httr)
require(rvest)
require(magrittr)
require(stringr)

headers = c(
  'User-Agent' = 'Mozilla/5.0',
  'Accept' = 'text/javascript, application/javascript, application/ecmascript, application/x-ecmascript, */*; q=0.01',
  'X-Requested-With' = 'XMLHttpRequest'
)

params = list(
  'page' = '2'
)

r <- httr::GET(url = 'https://www.tapology.com/fightcenter_events', httr::add_headers(.headers=headers), query = params)
x <- str_match_all(content(r,as="text") ,'html\\("(.*>)')
y <- gsub('"',"'",gsub('\\\\','', x[[1]][,2]))
z <- read_html(y) %>% html_nodes(., ".billing a") %>% html_attr(., "href")

Upvotes: 1

GAMELASTER

Reputation: 1091

You're getting redirected back, because the website checking the headers which are you sending. For getting a correct data, you need to have this headers set:

Accept: text/javascript, application/javascript, application/ecmascript, application/x-ecmascript, */*; q=0.01
X-Requested-With: XMLHttpRequest

Also, this request doesn't return the HTML of the webpage, but jQuery code, which updating the list on the website dynamically.

Upvotes: 0

Data from page2 same as from page1 when scraping

Answers (3)

Related Questions