Reputation: 395
I want to extract the link and download the file automatically for the first record with Type='AA'.
I managed to extract the table, but how do I extract the link in the last column for the type 'AA'?
library(rvest)
library(stringr)
url <- "https://beta.companieshouse.gov.uk/company/02280000/filing-history"
wahis.session <- html_session(url)
r <- wahis.session %>%
html_nodes(xpath = '//*[@id="fhTable"]') %>%
html_table(fill = T)
Upvotes: 0
Views: 2228
Reputation:
I would extract the tr
nodes, use purrr's map
to generate dataframes of .filing-type
's text and .download
's href
attribute, stack the dataframes with dplyr's bind_rows
, and finally filter based on type == "AA"
:
library(dplyr)
library(rvest)
library(purrr)
url <- "https://beta.companieshouse.gov.uk/company/02280000/filing-history"
html <- read_html(url)
html %>%
html_nodes("tr") %>%
map(~ tibble(type = html_text(html_node(., ".filing-type"), T),
href = html_attr(html_node(., ".download"), "href")
)) %>%
bind_rows() %>%
filter(type == "AA")
This returns a dataframe of paths for type "AA" documents:
type href
<chr> <chr>
1 AA /company/02280000/filing-history/MzIxMjY0MDgxOGFkaXF6a2N4/document?format=pdf&download=0
2 AA /company/02280000/filing-history/MzE4NDAwMDg1NGFkaXF6a2N4/document?format=pdf&download=0
Now you just need to paste together the domain and the paths, and then use either base R's download.file
or rvest's GET
with write_disk
to download the file.
Upvotes: 0
Reputation: 171
I'm assuming this website is OK with you automatically crawling through it, if you're not sure, check its robots.txt and the site's policy on crawling.
You actually have a lot of work ahead of you.
This script should help you extract the desired reports from a single page. If you want to make a script to extract it from all pages, I recommend checking out a tutorial on webscraping, such as this one https://www.datacamp.com/community/tutorials/r-web-scraping-rvest.
Another package you could check out is Rcrawler which will automate a lot of the extraction part of the script, but requires you to learn its functions.
library(rvest)
library(stringr)
url <- "https://beta.companieshouse.gov.uk/company/02280000/filing-history"
url2 <- "https://beta.companieshouse.gov.uk"
wahis.session <- html_session(url)
r <- wahis.session %>%
html_nodes(xpath = '//*[@id="fhTable"]') %>%
html_table(fill = T)
s <- wahis.session %>%
html_nodes(xpath = '//*[contains(concat( " ", @class, " " ), concat( " ", "download", " " ))]') %>%
html_attr("href")
r <- r[[1]] %>% as_tibble %>%
mutate(link = paste0(url2, s)) %>%
filter(Type == "AA")
n <- paste0("report",seq_along(r$link), ".pdf")
for(i in seq_along(n)) {
download.file(r$link[i], n[i], mode = "wb")
}
Upvotes: 5