The_maestro3
The_maestro3

Reputation: 9

Web Scraping Across multiple pages R

I have been working on some R code. The purpose is to collect the average word length and other stats about the words in a section of a website with 50 pages. Collecting the stats is no problem and it's a easy part. However, getting my code to collect the stats over 50 pages is the hard part, it only ever seems to output information from the first page. See the code below and ignore the poor indentation.

install.packages(c('tidytext', 'tidyverse'))

   library(tidyverse)
   library(tidytext)
   library(rvest)
   library(stringr)

   websitePage <- read_html('http://books.toscrape.com/catalogue/page-1.html')
   textSort <- websitePage %>%
   html_nodes('.product_pod a') %>%
   html_text()


   for (page_result in seq(from = 1, to = 50, by = 1)) {
      link = paste0('http://books.toscrape.com/catalogue/page-',page_result,'.html')

      page = read_html(link)

     # Creates a tibble
      textSort.tbl <- tibble(text = textSort)

      textSort.tidy <- textSort.tbl %>%
      funnest_tokens(word, text)

  }

   # Finds the average word length
    textSort.tidy %>%
      map(nchar) %>%
      map(mean)

   # Finds the most common words
    textSort.tidy %>%
    count(word, sort = TRUE)

    # Removes the stop words and then finds most common words
     textSort.tidy %>%
     anti_join(stop_words) %>%
     count(word, sort = TRUE)

    # Counts the number of times the word "Girl" is in the text
      textSort.tidy %>%
      count(word) %>%
      filter(word == "Girl")

Upvotes: 0

Views: 309

Answers (1)

Ronak Shah
Ronak Shah

Reputation: 388962

You can use lapply/map to extract the tetx from multiple links.

library(rvest)

link <- paste0('http://books.toscrape.com/catalogue/page-',1:50,'.html')

result <- lapply(link, function(x) x %>% 
                          read_html %>% 
                          html_nodes('.product_pod a') %>%
                          html_text)

You can continue using lapply if you want to apply other functions to text.

Upvotes: 1

Related Questions