Jane
Jane

Reputation: 395

How do I extract the download link and download the file in R?

I want to extract the link and download the file automatically for the first record with Type='AA'.

I managed to extract the table, but how do I extract the link in the last column for the type 'AA'?


library(rvest)
library(stringr)

url <- "https://beta.companieshouse.gov.uk/company/02280000/filing-history"
wahis.session <- html_session(url)                                
r <-    wahis.session %>%
  html_nodes(xpath = '//*[@id="fhTable"]') %>%
  html_table(fill = T) 

Upvotes: 0

Views: 2228

Answers (2)

user10191355
user10191355

Reputation:

I would extract the tr nodes, use purrr's map to generate dataframes of .filing-type's text and .download's href attribute, stack the dataframes with dplyr's bind_rows, and finally filter based on type == "AA":

library(dplyr)
library(rvest)
library(purrr)

url <- "https://beta.companieshouse.gov.uk/company/02280000/filing-history"

html <- read_html(url)

html %>%
    html_nodes("tr") %>% 
    map(~ tibble(type = html_text(html_node(., ".filing-type"), T),
                 href = html_attr(html_node(., ".download"), "href")
                 )) %>% 
    bind_rows() %>% 
    filter(type == "AA")

This returns a dataframe of paths for type "AA" documents:

  type  href                                                                                    
  <chr> <chr>                                                                                   
1 AA    /company/02280000/filing-history/MzIxMjY0MDgxOGFkaXF6a2N4/document?format=pdf&download=0
2 AA    /company/02280000/filing-history/MzE4NDAwMDg1NGFkaXF6a2N4/document?format=pdf&download=0

Now you just need to paste together the domain and the paths, and then use either base R's download.file or rvest's GET with write_disk to download the file.

Upvotes: 0

MAIAkoVSky
MAIAkoVSky

Reputation: 171

I'm assuming this website is OK with you automatically crawling through it, if you're not sure, check its robots.txt and the site's policy on crawling.

You actually have a lot of work ahead of you.

  1. How to extract only specific nodes rather all of them.
  2. How to extract links instead of the overlayed text string.
  3. How to download multiple files at once and name them.
  4. How to move to the next page and repeat the process.

This script should help you extract the desired reports from a single page. If you want to make a script to extract it from all pages, I recommend checking out a tutorial on webscraping, such as this one https://www.datacamp.com/community/tutorials/r-web-scraping-rvest.

Another package you could check out is Rcrawler which will automate a lot of the extraction part of the script, but requires you to learn its functions.

library(rvest)
library(stringr)

url <- "https://beta.companieshouse.gov.uk/company/02280000/filing-history"
url2 <- "https://beta.companieshouse.gov.uk"

wahis.session <- html_session(url)                                
r <-    wahis.session %>%
  html_nodes(xpath = '//*[@id="fhTable"]') %>%
  html_table(fill = T)

s <- wahis.session %>% 
  html_nodes(xpath = '//*[contains(concat( " ", @class, " " ), concat( " ",     "download", " " ))]') %>% 
  html_attr("href")

r <- r[[1]] %>% as_tibble %>% 
  mutate(link = paste0(url2, s)) %>% 
  filter(Type == "AA")

n <- paste0("report",seq_along(r$link), ".pdf")

for(i in seq_along(n)) {
  download.file(r$link[i], n[i], mode = "wb")
}

Upvotes: 5

Related Questions