Why can't I clean pdf table and rename columns as a function?

Question

I figured out how to scrape this PDF, but I have a lot of these files that I need to go through. My intention was to set this as a function, import data from all of the pdfs (one pdf per month for several years) and then do an rbind() to make one data table that I can then write as a csv.

This works.

library(tidyverse)
library(tabulizer)

#import the data
jan16s_raw <- extract_tables("https://www.nvsos.gov/sos/home/showdocument?id=4062")

#create data frame
cleanNvsen <- do.call(rbind, jan16s_raw)
cleanNvsen2 <-as.data.frame(cleanNvsen[3:nrow(cleanNvsen),])

#rename all of the columns
names(cleanNvsen2)[1] <- "District"
names(cleanNvsen2)[2] <- "Democrat"
names(cleanNvsen2)[3] <- "Independent American"
names(cleanNvsen2)[4] <- "Libertarian"
names(cleanNvsen2)[5] <- "Nonpartisan"
names(cleanNvsen2)[6] <- "Other"
names(cleanNvsen2)[7] <- "Republican"
names(cleanNvsen2)[8] <- "Total"

#check to see if it worked
head(example)

But this results in a 1 x 1 data frame

library(tidyverse)
library(tabulizer)

#load data
jan16s_raw <- extract_tables("https://www.nvsos.gov/sos/home/showdocument?id=4062")

#create function to create data frame and then rename 
clean <- function(x) {
cleanNvsen <- do.call(rbind, x)
cleanNvsen2 <-as.data.frame(cleanNvsen[3:nrow(cleanNvsen),])

names(cleanNvsen2)[1] <- "District"
names(cleanNvsen2)[2] <- "Democrat"
names(cleanNvsen2)[3] <- "Independent American"
names(cleanNvsen2)[4] <- "Libertarian"
names(cleanNvsen2)[5] <- "Nonpartisan"
names(cleanNvsen2)[6] <- "Other"
names(cleanNvsen2)[7] <- "Republican"
names(cleanNvsen2)[8] <- "Total"
}

x2 <- clean(jan16s_raw)

head(x2)

I'd really like to get this to work so that I can just feed R the url's and then run this clean function I've created. I have dozens of files to go through.

Ronak Shah · Accepted Answer

You can write the clean function to extract the data and renaming the columns. We can rename multiple columns at once and don't need to rename them individually.

clean <- function(url) {
  jan16s_raw <- extract_tables(url)
  #create data frame
  cleanNvsen <- do.call(rbind, jan16s_raw)
  cleanNvsen2 <- as.data.frame(cleanNvsen[3:nrow(cleanNvsen),])
  #rename all of the columns
  names(cleanNvsen2) <- c("District", "Democrat", "Independent American", 
                  "Libertarian","Nonpartisan","Other","Republican","Total")

  return(cleanNvsen2)
}

Create a vector of all the urls from which you want to extract the data.

list_of_urls <- c('https://www.nvsos.gov/sos/home/showdocument?id=4062', 
                  'https://www.nvsos.gov/sos/home/showdocument?id=4064')

Then call clean function for each of the url and combine the data.

all_data <- purrr::map_df(list_of_urls, clean)
#OR
#all_data <- do.call(rbind, lapply(list_of_urls, clean))

Why can't I clean pdf table and rename columns as a function?

Answers (1)

Related Questions

Why can&#39;t I clean pdf table and rename columns as a function?

Answers (1)

Related Questions

Why can't I clean pdf table and rename columns as a function?