toastienf
toastienf

Reputation: 57

Web scraping urls using a for loop

I am scraping tables from a website and have been scraping each web page one at a time but since the urls follow a pattern I am thinking of running the urls through a for loop.

I am trying to use the following script:

for(i in 1:38) {
  webpage <- read_html(paste0("www.website.com/", i))
  data <- webpage %>%
    html_nodes("table") %>%
    .[[1]] %>% 
    html_table()
}

My main issue is that the sites I am scraping do not follow a pattern I am able to put in the above for loop, but rather read as the following (if the /W wasn't included it would make it a lot easier): www.website.com/sample/test-01/W, www.website.com/sample/test-02/W, www.website.com/sample/test-03/W etc.

I feel as though there is an extremely simple way to place these into the above for loop but I am not sure of the syntax.

EDIT: one more issue is the 0 in the url www.website.com/sample/test-01/W. I can't paste the i after the 0 since the pattern goes 06-07-08-09-10-11 with the 0 not being valid after 09. And the website www.website.com/sample/test-012/W does not exist.

Upvotes: 0

Views: 550

Answers (2)

Ronak Shah
Ronak Shah

Reputation: 389235

You may create a vector of urls using sprintf -

web_urls <- sprintf('www.website.com/test-%02d/W', 1:38)

Then use lapply to extract the table from each url.

library(rvest)

extract_table <- function(url) {
  webpage <- read_html(url)
  data <- webpage %>%
    html_nodes("table") %>%
    .[[1]] %>% 
    html_table()
}

result <- lapply(web_urls, extract_table)

Upvotes: 1

Vishal A.
Vishal A.

Reputation: 1391

In order to append the \W at the end, you just need to use the pate0 function once again on the webpage.

for(i in 1:38) {
  webpage <- paste0("www.website.com/", i)
  temp <- paste0(webpage, "/W")

It will make the URL look like this:

www.website.com/1/W
www.website.com/2/W
...

To get the digits part, you can use the sprintf from base R. To get two-digit numbers you'll have to use sprintf("%02d", i) in a loop.

The code will look like this:

for(i in 1:38) {
  webpage <- paste0("www.website.com/", sprintf("%02d", i))
  temp <- paste0(webpage, "/W")
  print(temp)
}

Note: I've modified the code to prove my point.

The output will look like this:

[1] "www.website.com/01/W"
[1] "www.website.com/02/W"
[1] "www.website.com/03/W"
[1] "www.website.com/04/W"
[1] "www.website.com/05/W"
[1] "www.website.com/06/W"
[1] "www.website.com/07/W"
[1] "www.website.com/08/W"
[1] "www.website.com/09/W"
[1] "www.website.com/10/W"
[1] "www.website.com/11/W"
[1] "www.website.com/12/W"
[1] "www.website.com/13/W"
[1] "www.website.com/14/W"
[1] "www.website.com/15/W"
[1] "www.website.com/16/W"
[1] "www.website.com/17/W"
[1] "www.website.com/18/W"
[1] "www.website.com/19/W"
[1] "www.website.com/20/W"
[1] "www.website.com/21/W"
[1] "www.website.com/22/W"
[1] "www.website.com/23/W"
[1] "www.website.com/24/W"
[1] "www.website.com/25/W"
[1] "www.website.com/26/W"
[1] "www.website.com/27/W"
[1] "www.website.com/28/W"
[1] "www.website.com/29/W"
[1] "www.website.com/30/W"
[1] "www.website.com/31/W"
[1] "www.website.com/32/W"
[1] "www.website.com/33/W"
[1] "www.website.com/34/W"
[1] "www.website.com/35/W"
[1] "www.website.com/36/W"
[1] "www.website.com/37/W"
[1] "www.website.com/38/W"

Upvotes: 1

Related Questions