Reputation: 23
I'm working on creating an automated process to pull tables from a yearly PDF report. Ideally, I'd be able to take each year's report, pull the data from the table within it, combine all years into a large data frame, and then analyze it. Here is what I have so far (just focusing on one year of the report):
library(pdftools)
library(data.table)
library(dplyr)
download.file("https://higherlogicdownload.s3.amazonaws.com/NASBO/9d2d2db1-c943-4f1b-b750-0fca152d64c2/UploadedImages/SER%20Archive/State%20Expenditure%20Report%20(Fiscal%202014-2016)%20-%20S.pdf", "nasbo14_16.pdf", mode = "wb")
txt14_16 <- pdf_text("nasbo14_16.pdf")
## convert txt14_16 to data frame for analyzing
data <- toString(txt14_16[56])
data <- read.table(text = data, sep = "\n", as.is = TRUE)
data <- data[-c(1, 2, 3, 4, 5, 6, 7, 14, 20, 26, 34, 47, 52, 58, 65, 66, 67), ]
data <- gsub("[,]", "", data)
data <- gsub("[$]", "", data)
data <- gsub("\\s+", ",", gsub("^\\s+|\\s+$", "",data))
My problem is converting these raw table data into a dataframe that has each state for every row and their respective values for every column. I'm sure the solution is simple, but I'm just a little bit new to R! Any help?
EDIT: All of these solutions have been terrific and have worked perfectly. However, when I try a report from another year, I get some errors:
: ' 0' does not exist in current working directory ('C:/Users/joshua_hanson/Documents').
After trying this code for the next report:
download.file("https://higherlogicdownload.s3.amazonaws.com/NASBO/9d2d2db1-c943-4f1b-b750-0fca152d64c2/UploadedImages/SER%20Archive/2010%20State%20Expenditure%20Report.pdf", "nasbo09_11.pdf", mode = "wb")
txt09_11 <- pdf_text("nasbo09_11.pdf")
df <- txt09_11[54] %>%
read_lines() %>% # separate lines
grep('^\\s{2}\\w', ., value = TRUE) %>% # select lines with states, which start with space, space, letter
paste(collapse = '\n') %>% # recombine
read_fwf(fwf_empty(.)) %>% # read as fixed-width file
mutate_at(-1, parse_number) %>% # make numbers numbers
mutate(X1 = sub('*', '', X1, fixed = TRUE)) # get rid of asterisks in state names
Upvotes: 1
Views: 7922
Reputation: 43334
readr::read_fwf
has a fwf_empty
utility that will guess column widths for you, which makes the job a lot simpler:
library(tidyverse)
df <- txt14_16[56] %>%
read_lines() %>% # separate lines
grep('^\\s{2}\\w', ., value = TRUE) %>% # select lines with states, which start with space, space, letter
paste(collapse = '\n') %>% # recombine
read_fwf(fwf_empty(.)) %>% # read as fixed-width file
mutate_at(-1, parse_number) %>% # make numbers numbers
mutate(X1 = sub('*', '', X1, fixed = TRUE)) # get rid of asterisks in state names
df
#> # A tibble: 50 × 13
#> X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Connecticut 3779 2992 0 6771 3496 3483 0 6979 3612
#> 2 Maine 746 1767 267 2780 753 1510 270 2533 776
#> 3 Massachusetts 6359 5542 143 12044 6953 6771 174 13898 7411
#> 4 New Hampshire 491 660 175 1326 515 936 166 1617 523
#> 5 Rhode Island 998 1190 31 2219 998 1435 24 2457 953
#> 6 Vermont 282 797 332 1411 302 923 326 1551 337
#> 7 Delaware 662 1001 0 1663 668 1193 14 1875 689
#> 8 Maryland 2893 4807 860 8560 2896 5686 1061 9643 2812
#> 9 New Jersey 3961 6920 1043 11924 3831 8899 1053 13783 3955
#> 10 New York 10981 24237 4754 39972 11161 29393 5114 45668 11552
#> # ... with 40 more rows, and 3 more variables: X11 <dbl>, X12 <dbl>,
#> # X13 <dbl>
Obviously column names still need to be added, but the data is fairly usable at this point.
Upvotes: 6
Reputation: 160437
Your gsub
s are a bit overly aggressive. You are doing fine through your data[-c(1,...)]
, so I'll pick up from there, replacing all of your calls to gsub
:
# sloppy fixed-width parsing
dat2 <- read.fwf(textConnection(data), c(35,15,20,20,12,10,15,10,10,10,10,15,99))
# clean up extra whitespace
dat3 <- as.data.frame(lapply(dat2, trimws), stringsAsFactors = FALSE)
head(dat3)
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13
# 1 Connecticut* $3,779 $2,992 $0 $6,771 $3,496 $3,483 $0 $6,979 $3,612 $3,604 $0 $7,216
# 2 Maine* 746 1,767 267 2,780 753 1,510 270 2,533 776 1,605 274 2,655
# 3 Massachusetts 6,359 5,542 143 12,044 6,953 6,771 174 13,898 7,411 7,463 292 15,166
# 4 New Hampshire 491 660 175 1,326 515 936 166 1,617 523 1,197 238 1,958
# 5 Rhode Island 998 1,190 31 2,219 998 1,435 24 2,457 953 1,527 22 2,502
# 6 Vermont* 282 797 332 1,411 302 923 326 1,551 337 948 338 1,623
Caution: the widths I used (35,15,20,...) were hastily-derived, and though I think they work, admittedly I did not go row-by-row to verify that I did not chop something. Please verify!
Lastly, from here you likely want to remove $
and ,
and integerize, that's fairly straight-forward:
dat3[-1] <- lapply(dat3[-1], function(a) as.integer(gsub("[^0-9]", "", a)))
head(dat3)
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13
# 1 Connecticut* 3779 2992 0 6771 3496 3483 0 6979 3612 3604 0 7216
# 2 Maine* 746 1767 267 2780 753 1510 270 2533 776 1605 274 2655
# 3 Massachusetts 6359 5542 143 12044 6953 6771 174 13898 7411 7463 292 15166
# 4 New Hampshire 491 660 175 1326 515 936 166 1617 523 1197 238 1958
# 5 Rhode Island 998 1190 31 2219 998 1435 24 2457 953 1527 22 2502
# 6 Vermont* 282 797 332 1411 302 923 326 1551 337 948 338 1623
I'm guessing the asterisks in the state names is meaningful. This can be captured easily with grepl
and then removed:
dat3$ast <- grepl("\\*", dat3$V1)
dat3[[1]] <- gsub("\\*", "", dat3[[1]])
head(dat3)
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 ast
# 1 Connecticut 3779 2992 0 6771 3496 3483 0 6979 3612 3604 0 7216 TRUE
# 2 Maine 746 1767 267 2780 753 1510 270 2533 776 1605 274 2655 TRUE
# 3 Massachusetts 6359 5542 143 12044 6953 6771 174 13898 7411 7463 292 15166 FALSE
# 4 New Hampshire 491 660 175 1326 515 936 166 1617 523 1197 238 1958 FALSE
# 5 Rhode Island 998 1190 31 2219 998 1435 24 2457 953 1527 22 2502 FALSE
# 6 Vermont 282 797 332 1411 302 923 326 1551 337 948 338 1623 TRUE
Upvotes: 0