seansteele
seansteele

Reputation: 683

Web scraping: Combining tables in for-loop in R

I'm using a loop to scrape tables from a website. Can't figure out how to combine the tables into one data frame. The following code works to scrape the relevant information for one page, but I'm not sure how to add the new table to the first one (or a preexisting one). Thanks.

for (i in 1:10){
  
    link <- paste0("https://website.com/page",i)
    remDr$navigate(link)  

    # grab the html
    pg <- remDr$getPageSource() %>% .[[1]] %>%
         read_html()

    #grab the correct table
    table <- pg %>%
            html_nodes("table") %>%
            .[2] %>%
            html_table(fill = TRUE) %>%
            .[[1]] 
    
    # combine tables?
  
}

Upvotes: 0

Views: 170

Answers (1)

Count Orlok
Count Orlok

Reputation: 1007

If you want to keep the loop, declare a data frame before the loop body, and keep adding to it at every iteration using rbind:

big_df <- data.frame()

for (i in 1:10){

  link <- paste0("https://website.com/page", i)
  remDr$navigate(link)  

  # grab the html
  pg <- remDr$getPageSource() %>% .[[1]] %>%
          read_html()

  # grab the correct table
   table <- pg %>%
              html_nodes("table") %>%
              .[2] %>%
              html_table(fill = TRUE) %>%
              .[[1]] 

  # combine tables?
  big_df <- rbind(big_df, table)
}

A better (and faster) way of doing this would be to put the loop body in a function, lapply it to 1:10 to yield a list of data frames, and then use data.table::rbindlist to put all of those together:

df_list <- lapply(1:10, function (i) {

             link <- paste0("https://website.com/page", i)
             remDr$navigate(link)  

             # grab the html
             pg <- remDr$getPageSource() %>% .[[1]] %>%
                     read_html()

             # grab the correct table
             table <- pg %>%
                        html_nodes("table") %>%
                        .[2] %>%
                        html_table(fill = TRUE) %>%
                        .[[1]]

             return(table)
           })

big_df <- data.table::rbindlist(df_list)

Upvotes: 1

Related Questions