Clem
Clem

Reputation: 55

Web-Scraping - multiple pages with R

I need to scrape html tables from the web using R. There is one table per page of 1000 lines and there is 316 pages total. the link of the first url is here : " http://sumodb.sumogames.de/Query.aspx?show_form=0&columns=6&rowcount=5&showheya=on&showshusshin=on&showbirthdate=on&showhatsu=on&showintai=on&showheight=on&showweight=on&showhighest=on "

then i think only the offset is being incremented (1000,2000,3000...,316000) on the others urls

This is my code so far working for one page :

    library(XML)
    library(rvest)

url <- read_html("http://sumodb.sumogames.de/Query.aspx?show_form=0&columns=6&rowcount=5&showheya=on&showshusshin=on&showbirthdate=on&showhatsu=on&showintai=on&showheight=on&showweight=on&showhighest=on")
     
    table <- url %>%
         html_nodes(".record") %>%
         html_table(fill = TRUE)
     table

The css-selector on each page for the big table is ".record"

The final aim is to have the entire table in one CSV file.

Upvotes: 1

Views: 1375

Answers (1)

Frost_Maggot
Frost_Maggot

Reputation: 309

The following code should achieve what you are after but be warned it will take a very long time because the web-based query takes some intensive loading for every single page.

The code makes use of the next, previous and last buttons to cycle through the pages. A caveat to this is the first two and final two pages which have a different CSS selector and are therefore done manually.

The .txt file will need tidying up after completion.

library(XML)
library(rvest)

# Starting page URL
url <- read_html("http://sumodb.sumogames.de/Query.aspx?show_form=0&columns=6&rowcount=5&showheya=on&showshusshin=on&showbirthdate=on&showhatsu=on&showintai=on&showheight=on&showweight=on&showhighest=on")

# URL prefix
urlPrefix <- "http://sumodb.sumogames.de/"

# URL of last page
lastURL <- url %>%
  html_nodes('div+ div a+ a') %>%
  html_attr("href")
lastURL <- paste0(urlPrefix, lastURL)  
lastURL <- read_html(lastURL)

# URL of second to last page
penultimateURL <- lastURL %>%
  html_nodes('div+ div a+ a') %>%
  html_attr("href")
penultimateURL <- paste0(urlPrefix, penultimateURL)  
penultimateURL <- read_html(penultimateURL)

# Table of first page
tabletemp <- url %>%
  html_nodes(".record") %>%
  html_table(fill = TRUE)
tabletemp <- tabletemp[[1]]
names(tabletemp) <- tabletemp[1, ]
tabletemp <- tabletemp[-1, ]

# Create and write first table to a .txt file
write.table(tabletemp, 'table.txt', row.names = FALSE)

# URL of second page
nextURL <- url %>%
  html_nodes('div+ div a:nth-child(1)') %>%
  html_attr("href")
nextURL <- paste0(urlPrefix, nextURL) 
nextURL <- read_html(nextURL)

# Table of second page
tabletemp <- nextURL %>%
  html_nodes(".record") %>%
  html_table(fill = TRUE)
tabletemp <- tabletemp[[1]]
names(tabletemp) <- tabletemp[1, ]
tabletemp <- tabletemp[-1, ]

# Append second table to .txt file 
write.table(tabletemp, 'table.txt', row.names = FALSE, col.names = FALSE, append = TRUE)

# URL of third page
nextURL <- nextURL %>%
  html_nodes('div+ div a:nth-child(2)') %>%
  html_attr("href")
nextURL <- paste0(urlPrefix, nextURL) 
nextURL <- read_html(nextURL)

# cyle through pages 3 to N - 2
while(html_text(nextURL) != html_text(penultimateURL)){

  tabletemp <- nextURL %>%
    html_nodes(".record") %>%
    html_table(fill = TRUE)
  tabletemp <- tabletemp[[1]]
  names(tabletemp) <- tabletemp[1, ]
  tabletemp <- tabletemp[-1, ]

  write.table(tabletemp, 'table.txt', row.names = FALSE, col.names = FALSE, append = TRUE)

  nextURL <- nextURL %>%
    html_nodes('div+ div a:nth-child(3)') %>%
    html_attr("href")
  nextURL <- paste0(urlPrefix, nextURL)
  nextURL <- read_html(nextURL)

}

# Table of penultimate page
tabletemp <- penultimateURL %>%
  html_nodes(".record") %>%
  html_table(fill = TRUE)
tabletemp <- tabletemp[[1]]
names(tabletemp) <- tabletemp[1, ]
tabletemp <- tabletemp[-1, ]

# Append penultimate table to .txt file 
write.table(tabletemp, 'table.txt', row.names = FALSE, col.names = FALSE, append = TRUE)

# Table of last page  
tabletemp <- lastURL %>%
  html_nodes(".record") %>%
  html_table(fill = TRUE)
tabletemp <- tabletemp[[1]]
names(tabletemp) <- tabletemp[1, ]
tabletemp <- tabletemp[-1, ]

# Append last table to .txt file
write.table(tabletemp, 'table.txt', row.names = FALSE, col.names = FALSE, append = TRUE)

# Checking number of rows in final table
nrow(read.table('table.txt'))

If you want the code to run quicker for testing purposes, try starting at the fifth to last page or something similar, just be aware that the CSS selectors will have to be changed for the first and second pages.

I hope this helps :)

Upvotes: 1

Related Questions