Reputation: 2696
I'm using the Mechanize
ruby gem to click a button on the web to download a PDF file and save it to the local file system.
URL = "www.my-site.com"
agent = Mechanize.new
agent.pluggable_parser.pdf = Mechanize::File # FYI I have also tried Mechanize::FileSaver and Mechanize::Download here
page = agent.get(URL)
form = page.forms.first
button = page.form.button_with(:value => "Some Button Text")
local_file = "path/to/file.pdf"
response = agent.submit(form, button)
response.save_as(local_file)
But when I try to read this PDF file using the PDF::Reader
gem, I get an error "PDF does not contain EOF marker".
reader = PDF::Reader.new(local_file) # this also happens if I try to use PDF::Reader.new(response.body) and PDF::Reader.new(response.body_io) depending on the different pluggable_parser configurations mentioned above
#> PDF::Reader::MalformedPDFError: PDF does not contain EOF marker
I'm able to save the PDF locally and view it and it looks fine, but the PDF::Reader
gem is complaining about it missing an EOF marker.
So my question is: is there a way I could add an EOF marker into the PDF or something to get around this error so I can parse the PDF?
Thanks.
Related (unanswered) question: PDF does not contain EOF marker (PDF::Reader::MalformedPDFError) with pdf-reader
Related Docs:
EDIT:
I found the EOF marker somewhere in the middle of the downloaded file contents, followed by some HTML-looking stuff that I can't seem to figure out how to get rid of. I want to isolate the PDF content and then parse that, but still running into issues. Here is the full script I am using: https://gist.github.com/s2t2/c6766846d024edd696586b2bc7fee0bf
Upvotes: 0
Views: 2813
Reputation: 19221
The issue seems to be with the website you're accessing: http://employmentsummary.abaquestionnaire.org
The add HTML data at the end of the response.
However, you could truncate the response by searching for the first substring %EOF
and removing all the data after that.
i.e.:
pdf_data = result.body
pdf_data.slice!(0, pdf_data.index("%EOL").to_i + 4)
if(pdf_data.length <= 4)
# handle error
else
# save/send pdf_data
end
Upvotes: 0