Japes
Japes

Reputation: 209

R - Scrape a number of URLs and save individually

Disclaimer: I'm not a programmer by trade and my knowledge of R is limited to say the least. I've also already searched Stackoverflow for a solution (but to no avail).

Here's my situation: I need to scrape a series of webpages and save the data (not quite sure in what format, but I'll get to that). Fortunately the pages I need to scrape have a very logical naming structure (they use the date).

The base URL is: https://www.bbc.co.uk/schedules/p00fzl6p

I need to scrape everything from August 1st 2018 (for which the URL is https://www.bbc.co.uk/schedules/p00fzl6p/2018/08/01) until yesterday (for which the URL is https://www.bbc.co.uk/schedules/p00fzl6p/2020/05/17).

So far I've figured out to create a list of dates which can be appended to the base URL using the following:

dates <- seq(as.Date("2018-08-01"), as.Date("2020-05-17"), by=1)
dates <- format(dates,"20%y/%m/%d") 

I can append these to the base URL with the following:

url <- paste0("https://www.bbc.co.uk/schedules/p00fzl6p/",dates)

However, that's pretty much as far as I've gotten (not very far, I know!) I assume I need to use a for loop but my own attempts at this have proved futile. Perhaps I'm not approaching this the right way?

In case it's not clear, what I'm trying to do is to visit each URL and save the html as an individual html file (ideally labelled with the relevant date). In truth, I don't need all of the html (just the list of programmes and times) but I can extract that information from the relevant files at a later date.

Any guidance on the best way to approach this would be much appreciated! And if you need any more info, just ask.

Upvotes: 0

Views: 228

Answers (1)

user12728748
user12728748

Reputation: 8506

Have a look at the rvest package and associated tutorials. E.g. https://www.datacamp.com/community/tutorials/r-web-scraping-rvest. The messy part is extracting the fields the way you want them.

Here is one possible solution:

library(rvest)
#> Loading required package: xml2
library(magrittr)
library(stringr)
library(data.table)
dates <- seq(as.Date("2018-08-01"), as.Date("2020-05-17"), by=1)
dates <- format(dates,"20%y/%m/%d") 
urls <- paste0("https://www.bbc.co.uk/schedules/p00fzl6p/", dates)

get_data <- function(url){
    html <- tryCatch(read_html(url), error=function(e) NULL)
    if(is.null(html)) return(data.table(
        date=gsub("https://www.bbc.co.uk/schedules/p00fzl6p/", "", url),
        title=NA, description=NA)) else {
            time <- html %>%
                rvest::html_nodes('body') %>%
                xml2::xml_find_all("//div[contains(@class, 'broadcast__info grid 1/4 1/6@bpb2 1/6@bpw')]") %>%
                rvest::html_text() %>% gsub(".*([0-9]{2}.[0-9]{2}).*", "\\1", .)
            text <- html %>%
                rvest::html_nodes('body') %>% 
                xml2::xml_find_all("//div[contains(@class, 'programme__body')]") %>% 
                rvest::html_text() %>% 
                gsub("[ ]{2,}", " ", .) %>% gsub("[\n|\n ]{2,}", "\n", .) %>% 
                gsub("\n(R)\n", " (R)", ., fixed = TRUE) %>% 
                gsub("^\n|\n$", "", .) %>% 
                str_split_fixed(., "\n", 2) %>% 
                as.data.table() %>% setnames(.,  c("title", "description")) %>% 
                .[, `:=`(date = gsub("https://www.bbc.co.uk/schedules/p00fzl6p/", "", url),
                         time = time,
                         description = gsub("\n", " ", description))] %>% 
                setcolorder(., c("date", "time", "title", "description"))
            text
        }
}
res <- rbindlist(parallel::mclapply(urls, get_data, mc.cores = 6L))
res
#>              date  time
#>     1: 2018/08/01 06:00
#>     2: 2018/08/01 09:15
#>     3: 2018/08/01 10:00
#>     4: 2018/08/01 11:00
#>     5: 2018/08/01 11:45
#>    ---                 
#> 16760: 2020/05/17 22:20
#> 16761: 2020/05/17 22:30
#> 16762: 2020/05/17 00:20
#> 16763: 2020/05/17 01:20
#> 16764: 2020/05/17 01:25
#>                                                                       title
#>     1:                                                 Breakfast—01/08/2018
#>     2:                           Wanted Down Under—Series 11, Hanson Family
#>     3:                          Homes Under the Hammer—Series 21, Episode 6
#>     4:                                     Fake Britain—Series 7, Episode 7
#>     5: The Farmers' Country Showdown—Series 2 30-Minute Versions, Ploughing
#>    ---                                                                     
#> 16760:                                     BBC London—Late News, 17/05/2020
#> 16761:                                                       Educating Rita
#> 16762:                          The Real Marigold Hotel—Series 4, Episode 2
#> 16763:                                Weather for the Week Ahead—18/05/2020
#> 16764:                                            Joins BBC News—18/05/2020
#>                                                                                       description
#>     1:                The latest news, sport, business and weather from the BBC's Breakfast team.
#>     2: 22/24 Will a week in Melbourne help Keith persuade his wife Mary to move to Australia? (R)
#>     3:               Properties in Hertfordshire, Croydon and Derbyshire are sold at auction. (R)
#>     4:                       7/10 The fake sports memorabilia that cost collectors thousands. (R)
#>     5: 13/20 Farmers show the skill and passion needed to do well in a top ploughing competition.
#>    ---                                                                                           
#> 16760:                                            The latest news, sport and weather from London.
#> 16761:  Comedy drama about a hairdresser who dreams of rising above her drab urban existence. (R)
#> 16762:   2/4 The group take a night train to Madurai to attend the famous Chithirai festival. (R)
#> 16763:                                                                 Detailed weather forecast.
#> 16764:                          BBC One joins the BBC's rolling news channel for a night of news.

Created on 2020-05-18 by the reprex package (v0.3.0)

Upvotes: 1

Related Questions