Reputation: 69
Im trying to import this pdf, https://www.mountwashington.org/uploads/forms/2018/01.pdf , to r and get it formatted as a data frame. Is there a way to work with the weird headers and get just the main headers(not the bigger headers like location and station) for the data efficiently?
I was able to get what I wanted by converting the pdf to an excel file with a converter website and then manually editing the columns/rows in excel and then importing to r but this was very inefficient and would like to do it in r somehow. I tried using the tabulizer package but it gave me data as characters and completely unorganized.
This is what I'd like it to look like:
> a
DAY MAX MIN AVG NORM DEPART HEAT COOL TOTAL..EQUIV. SNOW...ICE AVG.WIND.SPEED..MPH. FASTEST.SPEED DIR
1 1 -14 -25 -19 6 -25 84 0 0.00 0.0 55.3 79 310 (NW)
2 2 -7 -23 -15 6 -21 80 0 0.01 0.7 53.8 84 280 (W)
3 3 7 -7 0 6 -6 65 0 T T 39.2 64 280 (W)
And this is what I was able to get with tabulizer:
[,1]
[1,] "WS FORM F-6"
[2,] ""
[3,] "PRELIMINARY LOCAL CLIMATOLOGICAL DATA"
[4,] ""
[5,] "LATITUDE LONGITUDE"
[6,] "44 DEGREES16 MINUTESNORTH 71 DEGREES 18 MINUTES WEST"
[7,] "TEMPERATURE (°F) PRECIPITATION (IN.)"
[8,] "DEGREE DAYS TOTAL SNOW & SNOW/ICE ON AVG"
[9,] "DAY MAX MIN AVG NORM DEPART HEAT COOL (EQUIV) ICE GROUND-7AM SPEED"
[10,] "1 -14 -25 -19 6 -25 84 0 0.00 0.0 23 55.3"
and then many more lines afterwards with more unorganized data that seemed to randomly pulled from the page.
Any help would be great, thanks!
Upvotes: 1
Views: 658
Reputation: 897
You can use tabulizer
to extract the table. Use locate_areas
to find the coordinates of the area to extract.
Take a look of this link
library(tabulizer)
# I used locate_areas("https://www.mountwashington.org/uploads/forms/2018/01.pdf")
# to find the area of the table to extract
mw_table <- extract_tables(
"https://www.mountwashington.org/uploads/forms/2018/01.pdf",
output = "data.frame",
area = list(c(103.49321, 15.79171, 402.56716, 586.74627)),
guess = FALSE
)
mw_table[[1]]
Then you just need to change the names of the dataframe.
Upvotes: 1