Maxwell
Maxwell

Reputation: 23

R tm package readPDF error in strptime(d, fmt) : input string too long

I would like to do text mining of the files on this website using the tm package. I am using the following code to download one of the files (i.e., abell.pdf) to my working directory and attempt to store the contents:

library("tm")
url <- "https://baltimore2006to2010acsprofiles.files.wordpress.com/2014/07/abell.pdf"
filename <- "abell.pdf"
download.file(url = url, destfile = filename, method = "curl")

doc <- readPDF(control = list(text = "-layout"))(elem = list(uri = filename),
                                                 language = "en", id = "id1")

But I receive the following error and warnings:

Error in strptime(d, fmt) : input string is too long
In addition: Warning messages:
1: In grepl(re, lines) : input string 1 is invalid in this locale
2: In grepl(re, lines) : input string 2 is invalid in this locale

The pdfs aren't particularly long (5 pages, 978 KB), and I have been able to successfully use the readPDF function to read in other pdf files on my Mac OSX. The information I want most (the total population for the 2010 census) is on the first page of each pdf, so I've tried shortening the pdf to just the first page, but I get the same message.

I am new to the tm package, so I apologize if I am missing something obvious. Any help is greatly appreciated!

Upvotes: 2

Views: 347

Answers (1)

Danny
Danny

Reputation: 36

Based on what I've read this error has something to do with the way that the "readPDF" function tries to make metadata for the file you're importing. Anyway, you can change the metadata info by using the "info" option. For example, I usually circumvent this error by modifying the command in the following way (using your code):

doc <- readPDF(control = list(info="-f",text = "-layout"))(elem = list(uri = filename),language = "en", id = "id1")

Where the addition of "info="-f"" is the only change. This doesn't really "fix" the problem, but it bypasses the error. Cheers :)

Upvotes: 2

Related Questions