Reputation: 25
I have just started with web scraping in R and I have trouble finding out how to scrape specific information from a website with several pages without having to do run the code for each individual url. So far I have managed to do it for the first page using this example: https://towardsdatascience.com/tidy-web-scraping-in-r-tutorial-and-resources-ac9f72b4fe47.
I have also managed to generate the urls based on pagenumber with this code:
list_of_pages <- str_c(url, '?page=', 1:32)
The problem is to integrate this and use the generated urls to get the information I need using one function and store it in a dataframe. This is the code I have for scraping the information:
hot100page <- "https://www.billboard.com/charts/hot-100"
hot100 <- read_html(hot100page)
rank <- hot100 %>%
rvest::html_nodes('body') %>%
xml2::xml_find_all("//span[contains(@class, 'chart-element__rank__number')]") %>%
rvest::html_text()
This is an example of the sturture of the website i plan to use the function for: https://www.amazon.com/s?k=statistics&ref=nb_sb_noss_2.
Upvotes: 2
Views: 136
Reputation: 52268
Here's a way to do it using rvest. Keep in mind, the particular website (hot100) doesn't actually use pagination, so the ?page=1
etc part of the url is meaningless (it just keeps loading the homepage). But for sites with pagination, this would work
library(tidyverse)
library(rvest)
hot100page <- "https://www.billboard.com/charts/hot-100"
hot100 <- read_html(hot100page)
df <- data.frame(rank=character(), somethingelse=character())
rank <- c()
for(i in 1:32) {
print(paste0("Scraping page ", i))
temp <- paste0(hot100page, '?page=', i) %>%
read_html %>%
rvest::html_nodes('body') %>%
xml2::xml_find_all("//span[contains(@class, 'chart-element__rank__number')]") %>%
rvest::html_text()
rank <- c(rank, temp)
}
df$rank <- rank
df
Upvotes: 1
Reputation: 1336
I suggest you to use RSelenium
.
Below a possible solution.
#Start the library
library(RSelenium)
#Start a selenium server and browser (you have to select it)
driver <- rsDriver(browser=c("firefox"), port = 4567L)
#Defines the client part.
remote_driver <- driver[["client"]]
#Sent the web site address to the firefox
remote_driver$navigate("https://www.amazon.com/s?k=statistics&ref=nb_sb_noss_2.")
#a empty list to save the data
all_books<-list()
#a loop to click next
for (i in 1:20) {
#sleeps to wait that the page is available
Sys.sleep(3)
#finds in the css environment the body
scroll_d <- remote_driver$findElement(using = "css", value = "body")
#sends to the browser to go to the end of the page
scroll_d$sendKeysToElement(list(key = "end"))
#gets all books, price, ranking, etc
all_books[i]<-remote_driver$findElement(using = 'css selector', value = 'span.s-latency-cf-section:nth-child(4)')$getElementText()
#pushes the button next
next_bottom<-remote_driver$findElement(using = 'css selector',value = '.a-last')
next_bottom$clickElement()
}
head(all_books)
[[1]]
[1] "1\nNew\nLife Goes On\nBTS\n-\n1\n1\n2\nFailing\nMood\n24kGoldn Featuring iann dior
Upvotes: 1