Reputation: 545
I figured out how to scrape this PDF, but I have a lot of these files that I need to go through. My intention was to set this as a function, import data from all of the pdfs (one pdf per month for several years) and then do an rbind() to make one data table that I can then write as a csv.
This works.
library(tidyverse)
library(tabulizer)
#import the data
jan16s_raw <- extract_tables("https://www.nvsos.gov/sos/home/showdocument?id=4062")
#create data frame
cleanNvsen <- do.call(rbind, jan16s_raw)
cleanNvsen2 <-as.data.frame(cleanNvsen[3:nrow(cleanNvsen),])
#rename all of the columns
names(cleanNvsen2)[1] <- "District"
names(cleanNvsen2)[2] <- "Democrat"
names(cleanNvsen2)[3] <- "Independent American"
names(cleanNvsen2)[4] <- "Libertarian"
names(cleanNvsen2)[5] <- "Nonpartisan"
names(cleanNvsen2)[6] <- "Other"
names(cleanNvsen2)[7] <- "Republican"
names(cleanNvsen2)[8] <- "Total"
#check to see if it worked
head(example)
But this results in a 1 x 1 data frame
library(tidyverse)
library(tabulizer)
#load data
jan16s_raw <- extract_tables("https://www.nvsos.gov/sos/home/showdocument?id=4062")
#create function to create data frame and then rename
clean <- function(x) {
cleanNvsen <- do.call(rbind, x)
cleanNvsen2 <-as.data.frame(cleanNvsen[3:nrow(cleanNvsen),])
names(cleanNvsen2)[1] <- "District"
names(cleanNvsen2)[2] <- "Democrat"
names(cleanNvsen2)[3] <- "Independent American"
names(cleanNvsen2)[4] <- "Libertarian"
names(cleanNvsen2)[5] <- "Nonpartisan"
names(cleanNvsen2)[6] <- "Other"
names(cleanNvsen2)[7] <- "Republican"
names(cleanNvsen2)[8] <- "Total"
}
x2 <- clean(jan16s_raw)
head(x2)
I'd really like to get this to work so that I can just feed R the url's and then run this clean function I've created. I have dozens of files to go through.
Upvotes: 0
Views: 70
Reputation: 389215
You can write the clean
function to extract the data and renaming the columns. We can rename multiple columns at once and don't need to rename them individually.
clean <- function(url) {
jan16s_raw <- extract_tables(url)
#create data frame
cleanNvsen <- do.call(rbind, jan16s_raw)
cleanNvsen2 <- as.data.frame(cleanNvsen[3:nrow(cleanNvsen),])
#rename all of the columns
names(cleanNvsen2) <- c("District", "Democrat", "Independent American",
"Libertarian","Nonpartisan","Other","Republican","Total")
return(cleanNvsen2)
}
Create a vector of all the urls from which you want to extract the data.
list_of_urls <- c('https://www.nvsos.gov/sos/home/showdocument?id=4062',
'https://www.nvsos.gov/sos/home/showdocument?id=4064')
Then call clean
function for each of the url and combine the data.
all_data <- purrr::map_df(list_of_urls, clean)
#OR
#all_data <- do.call(rbind, lapply(list_of_urls, clean))
Upvotes: 1