Robin vd Maat
Robin vd Maat

Reputation: 43

Download files from dataframe with multiple elements and rows that contains URLs

I scraped a number of URLs from multiple websites and put them in a large list, which contains 145 elements (for each website that was scraped). Every element has between 90-300 rows in a column called X[[i]]. What I want to do next to search for the word "agenda" in URLs in the list and download the documents using those URLs, but I have trouble doing this.

The code I have so far is:

## scrape urls  
  url_base <- "https://amsterdam.raadsinformatie.nl/sitemap/meetings/201%d"
    map_df(7:8, function(i){
    page <- read_html(sprintf(url_base, i))
    data_frame(urls = html_nodes(page, "a") %>% html_attr("href") )
    }) -> urls
    rcverg17_18 <- data.frame(urls[grep("raadscomm", urls$urls), ])

## clean data
  rcverg17_18v2 <- sub(" .*", "", rcverg17_18$urls)

## scrape urls from websites
  list <- map(rcverg17_18v2, function(url) {
    url <-  glue("https://amsterdam.raadsinformatie.nl{url}")
    read_html(url) %>%
    html_nodes("a") %>%
    html_attr("href")
    })
list2 <- lapply(list, as.data.frame)

This gives a large list that looks like:

list2

list2 list[145]                     List of length 145
[[1]] list[239 x 1] (S3: dataframe) A data.frame with 239 rows and 1 column
[[2]] list[139 x 1] (S3: dataframe) A data.frame with 139 rows and 1 column
[[3]] list[185 x 1] (S3: dataframe) A data.frame with 186 rows and 1 column
[[4]] list[170 x 1] (S3: dataframe) A data.frame with 170 rows and 1 column
[[.]] ...
[[.]] ...
[[.]] ...

A element contains different information for example:

list2[[1]] 

X[[i]]
1 #zoeken                                                                                            
2 #agenda_container                                                                                                                                                                 
3 #media_wrapper
4 ...

but also URLs with whitespaces in it, such as:

104            https://amsterdam.raadsinformatie.nl/document/4851596/1/ID_17-01-11_Termijnagenda_Verkeer_en_Vervoer

What I want is to search for URLs that contain 'agenda' in their URL-name and download the files using those URLs. I know that I have to use the download.file()-function to download the files, but I don't know exactly how. Also I don't know how to search for the URLs in this type of dataframe (with elements). Can anyone help me to finish the code?

Note that the whitespaces in the cells still have to be removed in order to download the files.

Upvotes: 3

Views: 497

Answers (1)

NM_
NM_

Reputation: 1999

We can achieve this with the following code

# Run this code after you create list but before you create list2

data.frame2 = function(x){
  data.frame(x, stringsAsFactors = F)
}

# New code for list 2  
list2 <- lapply(list, data.frame2)

# Convert list to data frame
df = do.call(rbind.data.frame, list2)

# obtain a vector of URL's which contain the work agenda
url.vec = df[df$x %like% "agenda", ]

# Remove elements of the vector which are the string "#agenda_container" (These are not URL's)
url.vec = url.vec[url.vec != "#agenda_container"]

# Obtain all URL's which contain the string "document". These URL's allow us to fetch documents. The URL's which don't contain "document" are webpages and and can not be fetched,
url.vec = url.vec[url.vec %like% "document"]

# Set the working directory
# setwd("~/PATH WHERE YOU WOULD LIKE TO SAVE THE FILES")

# Download files in a loop
# we have to add the extension ".pdf"
# temp.name will name your files with the last part of the URL, after the last backslash ("/")

for(i in url.vec){
  temp.name = tail(unlist(strsplit(i, "\\/")), 1)
  download.file(i, destfile = paste(temp.name,".pdf") )
}

Check your folder an all of your files should be downloaded. Here is the temporary folder that I downloaded your files into:

Folder after downloading all documents

Upvotes: 1

Related Questions