ASH
ASH

Reputation: 20342

Run through a few URLs, and import data from each

I am trying to figure out how to loop through a few URLs. This is just a learning exercise for myself. I thought I basically knew how to do this, but I have become stuck. I believe my code below is close, but it’s not incrementing for some reason scraping anything.

library(rvest)
URL <- "https://www.ebay.com/sch/i.html?_from=R40&_sacat=0&_nkw=mens%27s+shoes+size+11&rt=nc"
WS <- read_html(URL)
URLs <- WS %>% html_nodes("ResultSetItems") %>% html_attr("href") %>% as.character()

Basically, I went to ebay, entered a simple search term, found a key node, named ‘ResultSetItems’, and tried to scrape the items from that. Nothing happened. Also, I’m trying to figure out how to increment through let’s say 5 URLs, and apply the same logic. The URLs would look like this:

'https://www.ebay.com/sch/i.html?_from=R40&_sacat=0&_nkw=mens%27s+shoes+size+11&_pgn=1&_skc=0&rt=nc'              
    
'https://www.ebay.com/sch/i.html?_from=R40&_sacat=0&_nkw=mens%27s+shoes+size+11&_pgn=2&_skc=0&rt=nc'

'https://www.ebay.com/sch/i.html?_from=R40&_sacat=0&_nkw=mens%27s+shoes+size+11&_pgn=3&_skc=0&rt=nc'

'https://www.ebay.com/sch/i.html?_from=R40&_sacat=0&_nkw=mens%27s+shoes+size+11&_pgn=4&_skc=0&rt=nc'

'https://www.ebay.com/sch/i.html?_from=R40&_sacat=0&_nkw=mens%27s+shoes+size+11&_pgn=5&_skc=0&rt=nc'

I think the code should look something like this:

for(i in 1:5) 
{
  
   site <- paste("https://www.ebay.com/sch/i.html?_from=R40&_sacat=0&_nkw=mens%27s+shoes+size+11&_pgn=",i,"&_skc=0&rt=nc", jump, sep="")
   dfList <- lapply(site, function(i) {
       WS <- read_html(i)
       URLs <- WS %>% html_nodes("ResultSetItems") %>% html_attr("href") %>% as.character()
   })
}
finaldf <- do.call(rbind, webpage)

I can’t seem to get this working. I may be over-simplifying things.

Upvotes: 0

Views: 106

Answers (1)

Tito Sanz
Tito Sanz

Reputation: 1362

Here is how to. Given a set of urls (read_url in my case) you only need to apply using mapfunction.

library(rvest)
read_url %>% 
  map(~read_html(.))%>% 
  map(html_nodes, css = "xxxx") %>% 
  map(html_nodes, xpath = "xxx") %>% 
  map(html_attr, name = "xxx") %>%
  unlist()

You will get a list of objects on which you can apply the same functions to get the data you want. Once you have done it you just have to put the list together in a data frame.

But watching http://www.ebay.com/robots.txt seems to be not allow to scrape in this domain of ebay. Maybe you should choose another example to try. ;) HTH!

Edited

Your example on ebay can not give results because it is prohibited. To be more clear, I will use the example of this web page that allows web scraping. This is how I do it to avoid using functions of the apply family. First, we generate the list of pages from which to obtain information.

library(rvest)
library(tidyverse)

urls <- "http://books.toscrape.com/catalogue/page-"

pag <- 1:5

read_urls <- paste0(urls, pag, ".html")

read_urls %>% 
  map(read_html) -> p

Once obtained the information simply extract the information you want, using html_nodes to access through the information contained (using css and xpath when necessary), finally accessing to attr in title case or simply with html_text in price example. Finally convert to tibble:

#Extract titles from the pages
p %>%  
  map(html_nodes, "article") %>% 
  map(html_nodes, xpath = "./h3/a") %>% 
  map(html_attr, "title") %>% 
  unlist() -> titles

#Extract price from the pages
p %>% 
  map(html_nodes, "article") %>% 
  map(html_nodes, ".price_color") %>% 
  map(html_text) %>% 
  unlist() -> prices

r <- tibble(titles, prices)

As result:

# A tibble: 100 x 2
                                 titles prices
                                  <chr>  <chr>
1                  A Light in the Attic £51.77
2                    Tipping the Velvet £53.74
3                            Soumission £50.10
4                         Sharp Objects £47.82
5 Sapiens: A Brief History of Humankind £54.23
6                       The Requiem Red £22.65

Now it is possible to turn all this into a function. But I leave in your hands.

Upvotes: 1

Related Questions