Reputation: 57
I am scraping tables from a website and have been scraping each web page one at a time but since the urls follow a pattern I am thinking of running the urls through a for
loop.
I am trying to use the following script:
for(i in 1:38) {
webpage <- read_html(paste0("www.website.com/", i))
data <- webpage %>%
html_nodes("table") %>%
.[[1]] %>%
html_table()
}
My main issue is that the sites I am scraping do not follow a pattern I am able to put in the above for loop, but rather read as the following (if the /W wasn't included it would make it a lot easier): www.website.com/sample/test-01/W, www.website.com/sample/test-02/W, www.website.com/sample/test-03/W
etc.
I feel as though there is an extremely simple way to place these into the above for loop but I am not sure of the syntax.
EDIT: one more issue is the 0
in the url www.website.com/sample/test-01/W
. I can't paste the i
after the 0
since the pattern goes 06-07-08-09-10-11 with the 0
not being valid after 09
. And the website www.website.com/sample/test-012/W
does not exist.
Upvotes: 0
Views: 550
Reputation: 389235
You may create a vector of urls using sprintf
-
web_urls <- sprintf('www.website.com/test-%02d/W', 1:38)
Then use lapply
to extract the table from each url.
library(rvest)
extract_table <- function(url) {
webpage <- read_html(url)
data <- webpage %>%
html_nodes("table") %>%
.[[1]] %>%
html_table()
}
result <- lapply(web_urls, extract_table)
Upvotes: 1
Reputation: 1391
In order to append the \W
at the end, you just need to use the pate0
function once again on the webpage
.
for(i in 1:38) {
webpage <- paste0("www.website.com/", i)
temp <- paste0(webpage, "/W")
It will make the URL look like this:
www.website.com/1/W
www.website.com/2/W
...
To get the digits part, you can use the sprintf
from base R. To get two-digit numbers you'll have to use sprintf("%02d", i)
in a loop.
The code will look like this:
for(i in 1:38) {
webpage <- paste0("www.website.com/", sprintf("%02d", i))
temp <- paste0(webpage, "/W")
print(temp)
}
Note: I've modified the code to prove my point.
The output will look like this:
[1] "www.website.com/01/W"
[1] "www.website.com/02/W"
[1] "www.website.com/03/W"
[1] "www.website.com/04/W"
[1] "www.website.com/05/W"
[1] "www.website.com/06/W"
[1] "www.website.com/07/W"
[1] "www.website.com/08/W"
[1] "www.website.com/09/W"
[1] "www.website.com/10/W"
[1] "www.website.com/11/W"
[1] "www.website.com/12/W"
[1] "www.website.com/13/W"
[1] "www.website.com/14/W"
[1] "www.website.com/15/W"
[1] "www.website.com/16/W"
[1] "www.website.com/17/W"
[1] "www.website.com/18/W"
[1] "www.website.com/19/W"
[1] "www.website.com/20/W"
[1] "www.website.com/21/W"
[1] "www.website.com/22/W"
[1] "www.website.com/23/W"
[1] "www.website.com/24/W"
[1] "www.website.com/25/W"
[1] "www.website.com/26/W"
[1] "www.website.com/27/W"
[1] "www.website.com/28/W"
[1] "www.website.com/29/W"
[1] "www.website.com/30/W"
[1] "www.website.com/31/W"
[1] "www.website.com/32/W"
[1] "www.website.com/33/W"
[1] "www.website.com/34/W"
[1] "www.website.com/35/W"
[1] "www.website.com/36/W"
[1] "www.website.com/37/W"
[1] "www.website.com/38/W"
Upvotes: 1