Michael
Michael

Reputation: 11

Webscrape multiple tables with R (rvest)

I'm trying to scrape all the tables on the wiki page for CSI: https://en.wikipedia.org/wiki/List_of_CSI:_Crime_Scene_Investigation_episodes so far so good I've been able to scrape one table (season 1) with the code below, is there a for loop that could just loop through all the tables since they have the same class?

here is my R code

library(rvest)
url <- "https://en.wikipedia.org/wiki/List_of_CSI:_Crime_Scene_Investigation_episodes"
episodes <- url %>%
  read_html() %>%
  html_nodes('#mw-content-text > div > table:nth-child(14)') %>%
  html_table()
episodes <- episodes[[1]]

Update i just realised that each table selector has a different nth child selector, so i decided to assign each table selector to a variable like below. can i loop through each table now and assign the results to one DF/varible "episodes" adjusted code:

library(dplyr)
library(purrr)
url <- "https://en.wikipedia.org/wiki/List_of_CSI:_Crime_Scene_Investigation_episodes"
table1<- '#mw-content-text > div > table:nth-child(14)'
table2<- '#mw-content-text > div > table:nth-child(18)'
table3<- '#mw-content-text > div > table:nth-child(22)'
table4<- '#mw-content-text > div > table:nth-child(26)'
table5<- '#mw-content-text > div > table:nth-child(30)'
table6<- '#mw-content-text > div > table:nth-child(34)'
table7<- '#mw-content-text > div > table:nth-child(38)'
table8<- '#mw-content-text > div > table:nth-child(42)'
table9<- '#mw-content-text > div > table:nth-child(46)'
table10<- '#mw-content-text > div > table:nth-child(50)'
table11<- '#mw-content-text > div > table:nth-child(54)'
table12<- '#mw-content-text > div > table:nth-child(58)'
table13<- '#mw-content-text > div > table:nth-child(62)'
table14<- '#mw-content-text > div > table:nth-child(66)'
table15<- '#mw-content-text > div > table:nth-child(70)'
table16<- '#mw-content-text > div > table:nth-child(74)'
#table17<- '#mw-content-text > div > table:nth-child(79)'
episodes <- url %>%
  read_html() %>%
  html_nodes(table1) %>%
  html_table(fill = T)
episodes <- episodes[[1]]

write.csv(population, file = "test.csv")

Upvotes: 1

Views: 1180

Answers (1)

giocomai
giocomai

Reputation: 3528

If I understand it correctly, what you are trying to do is to put in a single data frame all tables except the first one, which lists the seasons and has different column names.

Assuming you have installed purrr and dplyr (both part of the tidyverse), the following should achieve what you want: first extract all tables, then put all of them (bar the first) in a single data frame.

library(rvest)

url <- "https://en.wikipedia.org/wiki/List_of_CSI:_Crime_Scene_Investigation_episodes"

episodes <- url %>%
  read_html() %>%
  html_nodes("table") %>%
  html_table(fill = TRUE)

purrr::map_dfr(episodes[-1], dplyr::bind_rows)

To clarify, the first pipe of commands creates a list of data frames with all tables.

map_dfr tells it to iterate over the given list and output a data frame.

Upvotes: 3

Related Questions