Reputation: 171
I have a list of base URLs as follows :
PostURL
www.abc.com/2315Azxc
www.abc.com/1478Bnbx
www.abc.com/6734Gytr
www.abc.com/8912Jqwe
Each URL has sub pages like
www.abc.com/2315Azxc&page=1
www.abc.com/2315Azxc&page=2
www.abc.com/2315Azxc&page=3
I know scraping of data from multiple sub pages of one base URL using rvest
as follows:
df<- lapply(paste0(' www.abc.com/2315Azxc&page=', 1:3),
function(url){
url %>% read_html() %>%
html_nodes(".xg_border") %>%
html_text()
})
But It will require much attention/time to scrape one by one. I am looking for the solution which can scrape data from multiple sub pages of base URLs.
Upvotes: 1
Views: 246
Reputation: 388907
You could construct link to all the URL's using outer
:
all_links <- c(t(outer(df$PostURL, paste0('&page=', 1:3), paste0)))
all_links
# [1] "www.abc.com/2315Azxc&page=1" "www.abc.com/2315Azxc&page=2" "www.abc.com/2315Azxc&page=3"
# [4] "www.abc.com/1478Bnbx&page=1" "www.abc.com/1478Bnbx&page=2" "www.abc.com/1478Bnbx&page=3"
# [7] "www.abc.com/6734Gytr&page=1" "www.abc.com/6734Gytr&page=2" "www.abc.com/6734Gytr&page=3"
#[10] "www.abc.com/8912Jqwe&page=1" "www.abc.com/8912Jqwe&page=2" "www.abc.com/8912Jqwe&page=3"
Now you can use the same lapply
code to scrape each page.
data
df <- structure(list(PostURL = c("www.abc.com/2315Azxc", "www.abc.com/1478Bnbx",
"www.abc.com/6734Gytr", "www.abc.com/8912Jqwe")),
class = "data.frame", row.names = c(NA, -4L))
Upvotes: 1