Reputation: 20342
I am trying to figure out how to loop through a few URLs. This is just a learning exercise for myself. I thought I basically knew how to do this, but I have become stuck. I believe my code below is close, but it’s not incrementing for some reason scraping anything.
library(rvest)
URL <- "https://www.ebay.com/sch/i.html?_from=R40&_sacat=0&_nkw=mens%27s+shoes+size+11&rt=nc"
WS <- read_html(URL)
URLs <- WS %>% html_nodes("ResultSetItems") %>% html_attr("href") %>% as.character()
Basically, I went to ebay, entered a simple search term, found a key node, named ‘ResultSetItems’, and tried to scrape the items from that. Nothing happened. Also, I’m trying to figure out how to increment through let’s say 5 URLs, and apply the same logic. The URLs would look like this:
'https://www.ebay.com/sch/i.html?_from=R40&_sacat=0&_nkw=mens%27s+shoes+size+11&_pgn=1&_skc=0&rt=nc'
'https://www.ebay.com/sch/i.html?_from=R40&_sacat=0&_nkw=mens%27s+shoes+size+11&_pgn=2&_skc=0&rt=nc'
'https://www.ebay.com/sch/i.html?_from=R40&_sacat=0&_nkw=mens%27s+shoes+size+11&_pgn=3&_skc=0&rt=nc'
'https://www.ebay.com/sch/i.html?_from=R40&_sacat=0&_nkw=mens%27s+shoes+size+11&_pgn=4&_skc=0&rt=nc'
'https://www.ebay.com/sch/i.html?_from=R40&_sacat=0&_nkw=mens%27s+shoes+size+11&_pgn=5&_skc=0&rt=nc'
I think the code should look something like this:
for(i in 1:5)
{
site <- paste("https://www.ebay.com/sch/i.html?_from=R40&_sacat=0&_nkw=mens%27s+shoes+size+11&_pgn=",i,"&_skc=0&rt=nc", jump, sep="")
dfList <- lapply(site, function(i) {
WS <- read_html(i)
URLs <- WS %>% html_nodes("ResultSetItems") %>% html_attr("href") %>% as.character()
})
}
finaldf <- do.call(rbind, webpage)
I can’t seem to get this working. I may be over-simplifying things.
Upvotes: 0
Views: 106
Reputation: 1362
Here is how to. Given a set of urls (read_url
in my case) you only need to apply using map
function.
library(rvest)
read_url %>%
map(~read_html(.))%>%
map(html_nodes, css = "xxxx") %>%
map(html_nodes, xpath = "xxx") %>%
map(html_attr, name = "xxx") %>%
unlist()
You will get a list of objects on which you can apply the same functions to get the data you want. Once you have done it you just have to put the list together in a data frame.
But watching http://www.ebay.com/robots.txt seems to be not allow to scrape in this domain of ebay. Maybe you should choose another example to try. ;) HTH!
Edited
Your example on ebay can not give results because it is prohibited. To be more clear, I will use the example of this web page that allows web scraping. This is how I do it to avoid using functions of the apply family. First, we generate the list of pages from which to obtain information.
library(rvest)
library(tidyverse)
urls <- "http://books.toscrape.com/catalogue/page-"
pag <- 1:5
read_urls <- paste0(urls, pag, ".html")
read_urls %>%
map(read_html) -> p
Once obtained the information simply extract the information you want, using html_nodes
to access through the information contained (using css and xpath when necessary), finally accessing to attr
in title case or simply with html_text
in price example. Finally convert to tibble:
#Extract titles from the pages
p %>%
map(html_nodes, "article") %>%
map(html_nodes, xpath = "./h3/a") %>%
map(html_attr, "title") %>%
unlist() -> titles
#Extract price from the pages
p %>%
map(html_nodes, "article") %>%
map(html_nodes, ".price_color") %>%
map(html_text) %>%
unlist() -> prices
r <- tibble(titles, prices)
As result:
# A tibble: 100 x 2
titles prices
<chr> <chr>
1 A Light in the Attic £51.77
2 Tipping the Velvet £53.74
3 Soumission £50.10
4 Sharp Objects £47.82
5 Sapiens: A Brief History of Humankind £54.23
6 The Requiem Red £22.65
Now it is possible to turn all this into a function. But I leave in your hands.
Upvotes: 1