Reputation: 23

Is there a way to automate reading PDFs across multiple webpages with rvest and pdftools?

I am working with all of the 2012 data from the following website: https://councildocs.dsm.city/resolutions/

The data are separated by meeting date and clicking on one date links to a different page with linked pdfs. There are 34 links for 2012 with multiple pdfs attached to the meeting date pages. The date format is yyymmdd for each meeting file.

[main page](https://i.sstatic.net/M1hKozpB.png)
[sub page](https://i.sstatic.net/A2wAnJH8.png)
[page with pdf](https://i.sstatic.net/LhRGFfrd.png)

My goal is to automate crawling and scraping all of the pdfs for 2012 (hopefully in one fell swoop) and to read them using R. (I do not want to download any of these files.) I have figured out how to scrape pdfs for individual meetings, but have no idea how to automate the task for all of the meetings.

Any help would be much appreciated! I have included example code for the first meeting's pdfs.

```{r setup}
library(tidyverse)
library(rvest)
library(pdftools)
```

```{r des moines}

# store base website
des_moines_2012 <- "https://councildocs.dsm.city/resolutions/"

# read the site
des_moines_2012_elements <- read_html(des_moines_2012) %>%
  # extract elements
  html_elements("a")

# reduce data to year 2012
des_moines_2012_elements_sub <-  
des_moines_2012_elements[str_detect(html_attrs(des_moines_2012_elements), "2012")]

# extract attributes and remove "/resolutions/" for joining parts of link
des_moines_2012_links <- des_moines_2012_elements_sub %>%
  html_attr("href") %>%
  str_remove_all("/resolutions/")

# create tibble with main link and sub link to create full link
des_moines_2012_full_links <- tibble(main_link = rep(des_moines_2012, 34),
                                 sub_link = des_moines_2012_links,
                                 full_link = str_c(main_link, sub_link))

```


```{r dm 20120109}

# read html for first 2012 link
pdfs_dm_2012_1 <- tibble(read_html(str_flatten_comma(des_moines_2012_full_links$full_link[1])) 
%>%
                       # extract elements
                       html_elements("a") %>%
                       # extract attributes
                       html_attrs() %>%
                       # extract link pdf
                       str_extract("(?<=[:digit:]/).*"))

# reform complete links
pdfs_2012_1_full <- pdfs_dm_2012_1 %>%
  mutate(main_link = des_moines_2012_full_links$full_link[1],
         full_link = str_c(main_link, `... %>% str_extract("(?<=[:digit:]/).*")`)) %>%
  # omit NAs to get rid of first descriptive entry
  na.omit()

# get pdf text for all pdfs in first html file
pert_pdfs_2012_1_dm <- sapply(pdfs_2012_1_full$full_link, function(x) pdf_text(x)) %>%
  # reduce data to documents containing the following strings
  str_extract("FEDERAL|(F|f)ederal|^US$|^us$|UNITED STATES|(U|u)nited (S|s)tates| 
(P|p)resident|PRESIDENT|OPPOS(E|ITION)|(O|o)ppos(e|ition)|(S|s)upport|(N|n)ational|NATIONAL")  
%>%
  # make tibble
  tibble()

pert_pdfs_2012_1_sap <- pert_pdfs_2012_1_dm %>%
  # add full link to tibble
  mutate(complete_link = pdfs_2012_1_full$full_link) %>%
  # omit NAs from str_extract
  na.omit(`.`)

# read pdf text from pertinent pdfs
sapply(pert_pdfs_2012_1_sap$complete_link, function(x) pdf_text(x))

```

Upvotes: 0

Answers (2)

Shaq

Reputation: 23

I used the following code as a solution to my original issue.

library(tidyverse)
library(rvest)
library(pdftools)


# store base website
des_moines_2012 <- "https://councildocs.dsm.city/resolutions/"

# read the site
des_moines_2012_elements <- read_html(des_moines_2012) %>%
# extract elements
html_elements("a")

# reduce data to year 2012
des_moines_2012_elements_sub <- 
des_moines_2012_elements[str_detect(html_attrs(des_moines_2012_elements), 
"/2012")]

# extract attributes and remove "/resolutions/" for joining parts of link
des_moines_2012_links <- des_moines_2012_elements_sub %>%
html_attr("href") %>%
str_remove_all("/resolutions/")

# create tibble with main link and sub link to create full link
des_moines_2012_full_links <- tibble(main_link = rep(des_moines_2012, 30),
                                 sub_link = des_moines_2012_links,
                                 full_link = str_c(main_link, sub_link))

# create list to store sites
output = list()

for (i in des_moines_2012_full_links$full_link) {
output[[i]] = read_html(i) %>%
html_elements("a") %>%
html_attrs() %>%
str_extract("(?<=[:digit:]{8}/)(.*).pdf") %>%
na.omit(output[[i]])
for (j in i) {
output[[j]] = paste0(i, output[[i]])
}
}

# read pdfs by meeting date
sapply(output$`https://councildocs.dsm.city/resolutions/20120109/`, 
pdf_text)

Upvotes: 0

margusl

Reputation: 17554

To extend your solution to cover more than one meeting, you could just collect links from all meetings first, somewhat modified approach for this is in the example bellow where I first get a vector of meeting links, then iterate though those with purrr::map() to get all pdf links (list of vectors, each list item is a vector of pdf links for one meeting).

Text extraction is handled by another map() iteration, now in a mutate() call. This example does not go any further, you can continue with keyword extraction & filtering from there.

Number of pdfs for all 2012 meetings is 1863, just to play safe (& be polite), request rate is limited by wrapping requesting functions with purrr::slowly(). By default it adds 1 second delay between request, meaning that final iteration takes more than 30 minutes to complete. In the example, the number of pdf-s is limited to 5.

This approach also assumes that all extracted text fits into available memory.

library(rvest)
library(stringr)
library(purrr)
library(dplyr)
library(tidyr)
library(pdftools)
#> Using poppler version 23.08.0

# extract schema + host
url_root <- \(url_) str_extract(url_, "^https?://.*?(?=(/|$))")

# extract all links that match the pattern, 
# use slowly(read_html)() to limit request rate (by default 1 req/s)
get_links <- function(url_, pattern){
  slowly(read_html)(url_) |> 
    html_elements("a") |> 
    html_attr("href") |> 
    str_subset(pattern)
}

url_ <- "https://councildocs.dsm.city/resolutions/"
(root_ <- url_root(url_))
#> [1] "https://councildocs.dsm.city"

# get vector of hrefs for 2012 meetings
get_links(url_, "/2012") |> 
# chr [1:30] "/resolutions/20120109/" "/resolutions/20120123/" ...

  # get pdf links for each meeting
  map(\(href) get_links(str_c(root_, href), "\\.pdf$")) |> 
  # List of 30
  #  $ : chr [1:64] "/resolutions/20120109/10.pdf" "/resolutions/20120109/11.pdf" ...
  #  $ : chr [1:71] "/resolutions/20120123/10.pdf" "/resolutions/20120123/10I.pdf" ...
  # ...

  # combine list of vectors to a vector, create tibble
  list_c() |>
  tibble(pdf_href = _) |> 
  # extract meeting and doc columns from pdf_href column 
  separate_wider_delim(pdf_href, "/", names = c(NA, NA, "meeting", "doc"), cols_remove = FALSE) |> 
  #  meeting  doc    pdf_href                    
  #  <chr>    <chr>  <chr>                       
  #1 20120109 10.pdf /resolutions/20120109/10.pdf
  #2 20120109 11.pdf /resolutions/20120109/11.pdf
  #  ...

  # for testing, limit the number of pdf documents to 5
  head(5) |> 
  # build complete url, (slowly) fetch & read each pdf, store text in a list column
  # (pdf_text returns vector of strings, each item corresponds to a page in pdf)
  mutate(
    pdf_href = str_c(root_, pdf_href),
    pdf_text = map(pdf_href, \(url_) slowly(pdf_text)(url_))) |> 
  # unnest pdf_text column (each pdf page to separate row), add a column with page number
  unnest_longer(pdf_text, indices_to = "page")

Resulting frame with text from 5 first pdf documents, 25 pages in total:

#> # A tibble: 25 × 5
#>    meeting  doc    pdf_href                                       pdf_text  page
#>    <chr>    <chr>  <chr>                                          <chr>    <int>
#>  1 20120109 10.pdf https://councildocs.dsm.city/resolutions/2012… "* Roll…     1
#>  2 20120109 10.pdf https://councildocs.dsm.city/resolutions/2012… "   * R…     2
#>  3 20120109 11.pdf https://councildocs.dsm.city/resolutions/2012… "* Roll…     1
#>  4 20120109 11.pdf https://councildocs.dsm.city/resolutions/2012… "* Roll…     2
#>  5 20120109 11.pdf https://councildocs.dsm.city/resolutions/2012… "* Roll…     3
#>  6 20120109 12.pdf https://councildocs.dsm.city/resolutions/2012… "* Roll…     1
#>  7 20120109 12.pdf https://councildocs.dsm.city/resolutions/2012… "1/4/12…     2
#>  8 20120109 12.pdf https://councildocs.dsm.city/resolutions/2012… "1/4/12…     3
#>  9 20120109 12.pdf https://councildocs.dsm.city/resolutions/2012… "1/4/12…     4
#> 10 20120109 13.pdf https://councildocs.dsm.city/resolutions/2012… "* Roll…     1
#> # ℹ 15 more rows

This one option to go for "one fell swoop" and in ideal world should work just fine. Currently there's no error checking nor exception handling (safely() & possibly() from purrr would be excellent for this when using in pipelines ), meaning that if something unexpected happens during retrieval / parsing of a single file (e.g. the last one), you have to start all over.

For just tuning and debugging alone I personally would first download and keep all pdf documents so I would not need to transfer complete set of files (700+ MB in total) for every minor tweak & change.

Upvotes: 0

Is there a way to automate reading PDFs across multiple webpages with rvest and pdftools?

Answers (2)

Related Questions