Reputation: 37
I need to collect the links from 3 pages, each having 150 links, using R with rvest library. I used a for-loop to crawl through the pages. I know that it's a very basic question, which has been answered elsewhere: R web scraping across multiple pages Scrape and Loop with Rvest I tried different versions of the following code. Most of them worked but returned only 50 instead of 150 links
library(rvest)
baseurl <- "https://www.ebay.co.uk/sch/i.html?_from=R40&_nkw=chain+and+sprocket&_sacat=0&_pgn="
n <- 1:3
nextpages <- paste0(baseurl, n)
for(i in nextpages){
html <- read_html(nextpages)
links <- html %>% html_nodes("a.vip") %>% html_attr("href")
}
The code is expected to return all the 150, instead of just 50.
Upvotes: 0
Views: 419
Reputation: 1502
You're overwriting the links variable in every iteration, so you would only end up with the last 50 links.
But you're looping using the 'i' variable, whereas your read_html() function uses the nextpages variable, which is actually a vector of 3 urls. You should be getting an error.
Try this:
links <- c()
for(i in nextpages){
html <- read_html(i)
links <- c(links, html %>% html_nodes("a.vip") %>% html_attr("href"))
}
Upvotes: 1
Reputation: 388907
We can use map
instead of a for
loop.
library(rvest)
library(purrr)
map(nextpages, . %>% read_html %>%
html_nodes("a.vip") %>%
html_attr("href")) %>% flatten_chr()
#[1] "https://www.ebay.co.uk/itm/Genuine-Honda-Chain-and-sprocket-set-Honda-Cub-C50-C70-C90-Heavy-Duty/254287014069?hash=item3b34afe8b5:g:wjEAAOSwqaBdH69W"
#[2] "https://www.ebay.co.uk/itm/DID-Heavy-Duty-Drive-Chain-And-JT-Sprocket-Kit-For-Honda-MSX125-Grom-2013-2019/223130604262?hash=item33f39ed2e6:g:QmwAAOSwdrpcAQ4c"
#.....
#...
Upvotes: 1