mrsama
mrsama

Reputation: 25

Web scrape multiple links with r

I'm trying to scrape some tennis stats with r from multiple links using rvest and selectorgadget. Page I scrape from is http://www.atpworldtour.com/en/scores/archive/stockholm/429/2017/results and there are 29 links that look like this: "http://www.atpworldtour.com/en/scores/2017/429/MS001/match-stats". All the links look the same but change from MS001-MS029. Using the below code I get the desired result with only the first 9 links. I see the problem but don't know how to correct it. First 9 links have double 00 and the rest have single 0. The 10th link should be MS010. Any help with this much appreciated.

library(xml)
library(rvest)
library(stringr)

round <- 1:29
urls <- paste0("http://www.atpworldtour.com/en/scores/2017/429/MS00", round, 
"/match-stats")

aces <- function(url) {
url %>%
read_html() %>%
html_nodes(".percent-on:nth-child(3) .match-stats-number-left span") %>%
html_text() %>% 
as.numeric()
}

results <- sapply(urls, aces)
results
$`http://www.atpworldtour.com/en/scores/2017/429/MS001/match-stats`
[1] 9

$`http://www.atpworldtour.com/en/scores/2017/429/MS002/match-stats`
[1] 8

$`http://www.atpworldtour.com/en/scores/2017/429/MS003/match-stats`
[1] 5

$`http://www.atpworldtour.com/en/scores/2017/429/MS004/match-stats`
[1] 4

$`http://www.atpworldtour.com/en/scores/2017/429/MS005/match-stats`
[1] 8

$`http://www.atpworldtour.com/en/scores/2017/429/MS006/match-stats`
[1] 9

$`http://www.atpworldtour.com/en/scores/2017/429/MS007/match-stats`
[1] 2

$`http://www.atpworldtour.com/en/scores/2017/429/MS008/match-stats`
[1] 9

$`http://www.atpworldtour.com/en/scores/2017/429/MS009/match-stats`
[1] 5

$`http://www.atpworldtour.com/en/scores/2017/429/MS0010/match-stats`
numeric(0)

Upvotes: 1

Views: 1022

Answers (1)

Len Greski
Len Greski

Reputation: 10875

One can generate leading zeroes in a formatted string via the sprintf() function.

ids <- 1:29
urlList <- sapply(ids,function(x){     
sprintf("%s%03d%s","http://www.atpworldtour.com/en/scores/2017/429/MS",
         x,"/match-stats")
})
# print a few items
urlList[c(1,9,10,29)]

...and the output:

> urlList[c(1,9,10,29)]
[1] "http://www.atpworldtour.com/en/scores/2017/429/MS001/match-stats"
[2] "http://www.atpworldtour.com/en/scores/2017/429/MS009/match-stats"
[3] "http://www.atpworldtour.com/en/scores/2017/429/MS010/match-stats"
[4] "http://www.atpworldtour.com/en/scores/2017/429/MS029/match-stats"
> 

Upvotes: 1

Related Questions