FoxyFetch333
FoxyFetch333

Reputation: 23

converting PDF table to data.frame in R...table to data.frame

I'm working on creating an automated process to pull tables from a yearly PDF report. Ideally, I'd be able to take each year's report, pull the data from the table within it, combine all years into a large data frame, and then analyze it. Here is what I have so far (just focusing on one year of the report):

library(pdftools)
library(data.table)
library(dplyr)

download.file("https://higherlogicdownload.s3.amazonaws.com/NASBO/9d2d2db1-c943-4f1b-b750-0fca152d64c2/UploadedImages/SER%20Archive/State%20Expenditure%20Report%20(Fiscal%202014-2016)%20-%20S.pdf", "nasbo14_16.pdf", mode = "wb")

txt14_16 <- pdf_text("nasbo14_16.pdf")


 ## convert txt14_16 to data frame for analyzing
data <- toString(txt14_16[56])
data <- read.table(text = data, sep = "\n", as.is = TRUE)
data <- data[-c(1, 2, 3, 4, 5, 6, 7, 14, 20, 26, 34, 47, 52, 58, 65, 66, 67), ]
data <- gsub("[,]", "", data)
data <- gsub("[$]", "", data)
data <- gsub("\\s+", ",", gsub("^\\s+|\\s+$", "",data))

My problem is converting these raw table data into a dataframe that has each state for every row and their respective values for every column. I'm sure the solution is simple, but I'm just a little bit new to R! Any help?

EDIT: All of these solutions have been terrific and have worked perfectly. However, when I try a report from another year, I get some errors:

: '  0' does not exist in current working directory ('C:/Users/joshua_hanson/Documents').

After trying this code for the next report:

convert txt09_11 to data frame for analyzing

download.file("https://higherlogicdownload.s3.amazonaws.com/NASBO/9d2d2db1-c943-4f1b-b750-0fca152d64c2/UploadedImages/SER%20Archive/2010%20State%20Expenditure%20Report.pdf", "nasbo09_11.pdf", mode = "wb")

txt09_11 <- pdf_text("nasbo09_11.pdf")

df <- txt09_11[54] %>%
read_lines() %>%    # separate lines
grep('^\\s{2}\\w', ., value = TRUE) %>%    # select lines with states, which start with space, space, letter
paste(collapse = '\n') %>%    # recombine
read_fwf(fwf_empty(.)) %>%    # read as fixed-width file
mutate_at(-1, parse_number) %>%    # make numbers numbers
mutate(X1 = sub('*', '', X1, fixed = TRUE))    # get rid of asterisks in state names

Upvotes: 1

Views: 7922

Answers (2)

alistaire
alistaire

Reputation: 43334

readr::read_fwf has a fwf_empty utility that will guess column widths for you, which makes the job a lot simpler:

library(tidyverse)

df <- txt14_16[56] %>% 
    read_lines() %>%    # separate lines
    grep('^\\s{2}\\w', ., value = TRUE) %>%    # select lines with states, which start with space, space, letter
    paste(collapse = '\n') %>%    # recombine
    read_fwf(fwf_empty(.)) %>%    # read as fixed-width file
    mutate_at(-1, parse_number) %>%    # make numbers numbers
    mutate(X1 = sub('*', '', X1, fixed = TRUE))    # get rid of asterisks in state names

df
#> # A tibble: 50 × 13
#>               X1    X2    X3    X4    X5    X6    X7    X8    X9   X10
#>            <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1    Connecticut  3779  2992     0  6771  3496  3483     0  6979  3612
#> 2          Maine   746  1767   267  2780   753  1510   270  2533   776
#> 3  Massachusetts  6359  5542   143 12044  6953  6771   174 13898  7411
#> 4  New Hampshire   491   660   175  1326   515   936   166  1617   523
#> 5   Rhode Island   998  1190    31  2219   998  1435    24  2457   953
#> 6        Vermont   282   797   332  1411   302   923   326  1551   337
#> 7       Delaware   662  1001     0  1663   668  1193    14  1875   689
#> 8       Maryland  2893  4807   860  8560  2896  5686  1061  9643  2812
#> 9     New Jersey  3961  6920  1043 11924  3831  8899  1053 13783  3955
#> 10      New York 10981 24237  4754 39972 11161 29393  5114 45668 11552
#> # ... with 40 more rows, and 3 more variables: X11 <dbl>, X12 <dbl>,
#> #   X13 <dbl>

Obviously column names still need to be added, but the data is fairly usable at this point.

Upvotes: 6

r2evans
r2evans

Reputation: 160437

Your gsubs are a bit overly aggressive. You are doing fine through your data[-c(1,...)], so I'll pick up from there, replacing all of your calls to gsub:

# sloppy fixed-width parsing
dat2 <- read.fwf(textConnection(data), c(35,15,20,20,12,10,15,10,10,10,10,15,99))
# clean up extra whitespace
dat3 <- as.data.frame(lapply(dat2, trimws), stringsAsFactors = FALSE)
head(dat3)
#              V1     V2     V3  V4     V5     V6     V7  V8     V9    V10    V11 V12    V13
# 1  Connecticut* $3,779 $2,992  $0 $6,771 $3,496 $3,483  $0 $6,979 $3,612 $3,604  $0 $7,216
# 2        Maine*    746  1,767 267  2,780    753  1,510 270  2,533    776  1,605 274  2,655
# 3 Massachusetts  6,359  5,542 143 12,044  6,953  6,771 174 13,898  7,411  7,463 292 15,166
# 4 New Hampshire    491    660 175  1,326    515    936 166  1,617    523  1,197 238  1,958
# 5  Rhode Island    998  1,190  31  2,219    998  1,435  24  2,457    953  1,527  22  2,502
# 6      Vermont*    282    797 332  1,411    302    923 326  1,551    337    948 338  1,623

Caution: the widths I used (35,15,20,...) were hastily-derived, and though I think they work, admittedly I did not go row-by-row to verify that I did not chop something. Please verify!

Lastly, from here you likely want to remove $ and , and integerize, that's fairly straight-forward:

dat3[-1] <- lapply(dat3[-1], function(a) as.integer(gsub("[^0-9]", "", a)))
head(dat3)
#              V1   V2   V3  V4    V5   V6   V7  V8    V9  V10  V11 V12   V13
# 1  Connecticut* 3779 2992   0  6771 3496 3483   0  6979 3612 3604   0  7216
# 2        Maine*  746 1767 267  2780  753 1510 270  2533  776 1605 274  2655
# 3 Massachusetts 6359 5542 143 12044 6953 6771 174 13898 7411 7463 292 15166
# 4 New Hampshire  491  660 175  1326  515  936 166  1617  523 1197 238  1958
# 5  Rhode Island  998 1190  31  2219  998 1435  24  2457  953 1527  22  2502
# 6      Vermont*  282  797 332  1411  302  923 326  1551  337  948 338  1623

I'm guessing the asterisks in the state names is meaningful. This can be captured easily with grepl and then removed:

dat3$ast <- grepl("\\*", dat3$V1)
dat3[[1]] <- gsub("\\*", "", dat3[[1]])
head(dat3)
#              V1   V2   V3  V4    V5   V6   V7  V8    V9  V10  V11 V12   V13   ast
# 1   Connecticut 3779 2992   0  6771 3496 3483   0  6979 3612 3604   0  7216  TRUE
# 2         Maine  746 1767 267  2780  753 1510 270  2533  776 1605 274  2655  TRUE
# 3 Massachusetts 6359 5542 143 12044 6953 6771 174 13898 7411 7463 292 15166 FALSE
# 4 New Hampshire  491  660 175  1326  515  936 166  1617  523 1197 238  1958 FALSE
# 5  Rhode Island  998 1190  31  2219  998 1435  24  2457  953 1527  22  2502 FALSE
# 6       Vermont  282  797 332  1411  302  923 326  1551  337  948 338  1623  TRUE

Upvotes: 0

Related Questions