converting PDF table to data.frame in R...table to data.frame

Question

I'm working on creating an automated process to pull tables from a yearly PDF report. Ideally, I'd be able to take each year's report, pull the data from the table within it, combine all years into a large data frame, and then analyze it. Here is what I have so far (just focusing on one year of the report):

library(pdftools)
library(data.table)
library(dplyr)

download.file("https://higherlogicdownload.s3.amazonaws.com/NASBO/9d2d2db1-c943-4f1b-b750-0fca152d64c2/UploadedImages/SER%20Archive/State%20Expenditure%20Report%20(Fiscal%202014-2016)%20-%20S.pdf", "nasbo14_16.pdf", mode = "wb")

txt14_16 <- pdf_text("nasbo14_16.pdf")


 ## convert txt14_16 to data frame for analyzing
data <- toString(txt14_16[56])
data <- read.table(text = data, sep = "
", as.is = TRUE)
data <- data[-c(1, 2, 3, 4, 5, 6, 7, 14, 20, 26, 34, 47, 52, 58, 65, 66, 67), ]
data <- gsub("[,]", "", data)
data <- gsub("[$]", "", data)
data <- gsub("\s+", ",", gsub("^\s+|\s+$", "",data))

My problem is converting these raw table data into a dataframe that has each state for every row and their respective values for every column. I'm sure the solution is simple, but I'm just a little bit new to R! Any help?

EDIT: All of these solutions have been terrific and have worked perfectly. However, when I try a report from another year, I get some errors:

: '  0' does not exist in current working directory ('C:/Users/joshua_hanson/Documents').

After trying this code for the next report:

convert txt09_11 to data frame for analyzing

download.file("https://higherlogicdownload.s3.amazonaws.com/NASBO/9d2d2db1-c943-4f1b-b750-0fca152d64c2/UploadedImages/SER%20Archive/2010%20State%20Expenditure%20Report.pdf", "nasbo09_11.pdf", mode = "wb")

txt09_11 <- pdf_text("nasbo09_11.pdf")

df <- txt09_11[54] %>%
read_lines() %>%    # separate lines
grep('^\s{2}\w', ., value = TRUE) %>%    # select lines with states, which start with space, space, letter
paste(collapse = '
') %>%    # recombine
read_fwf(fwf_empty(.)) %>%    # read as fixed-width file
mutate_at(-1, parse_number) %>%    # make numbers numbers
mutate(X1 = sub('*', '', X1, fixed = TRUE))    # get rid of asterisks in state names

alistaire · Accepted Answer

readr::read_fwf has a fwf_empty utility that will guess column widths for you, which makes the job a lot simpler:

library(tidyverse)

df <- txt14_16[56] %>% 
    read_lines() %>%    # separate lines
    grep('^\s{2}\w', ., value = TRUE) %>%    # select lines with states, which start with space, space, letter
    paste(collapse = '
') %>%    # recombine
    read_fwf(fwf_empty(.)) %>%    # read as fixed-width file
    mutate_at(-1, parse_number) %>%    # make numbers numbers
    mutate(X1 = sub('*', '', X1, fixed = TRUE))    # get rid of asterisks in state names

df
#> # A tibble: 50 × 13
#>               X1    X2    X3    X4    X5    X6    X7    X8    X9   X10
#>                     
#> 1    Connecticut  3779  2992     0  6771  3496  3483     0  6979  3612
#> 2          Maine   746  1767   267  2780   753  1510   270  2533   776
#> 3  Massachusetts  6359  5542   143 12044  6953  6771   174 13898  7411
#> 4  New Hampshire   491   660   175  1326   515   936   166  1617   523
#> 5   Rhode Island   998  1190    31  2219   998  1435    24  2457   953
#> 6        Vermont   282   797   332  1411   302   923   326  1551   337
#> 7       Delaware   662  1001     0  1663   668  1193    14  1875   689
#> 8       Maryland  2893  4807   860  8560  2896  5686  1061  9643  2812
#> 9     New Jersey  3961  6920  1043 11924  3831  8899  1053 13783  3955
#> 10      New York 10981 24237  4754 39972 11161 29393  5114 45668 11552
#> # ... with 40 more rows, and 3 more variables: X11 , X12 ,
#> #   X13

Obviously column names still need to be added, but the data is fairly usable at this point.

converting PDF table to data.frame in R...table to data.frame

convert txt09_11 to data frame for analyzing

Answers (2)

Related Questions