s2t2
s2t2

Reputation: 2696

Ruby - How to add EOF marker into a PDF file or otherwise bypass PDF::Reader::MalformedPDFError: PDF does not contain EOF marker

I'm using the Mechanize ruby gem to click a button on the web to download a PDF file and save it to the local file system.

URL = "www.my-site.com"
agent = Mechanize.new
agent.pluggable_parser.pdf = Mechanize::File # FYI I have also tried Mechanize::FileSaver and Mechanize::Download here

page = agent.get(URL)
form = page.forms.first
button = page.form.button_with(:value => "Some Button Text")

local_file = "path/to/file.pdf"
response = agent.submit(form, button)
response.save_as(local_file)

But when I try to read this PDF file using the PDF::Reader gem, I get an error "PDF does not contain EOF marker".

reader = PDF::Reader.new(local_file) # this also happens if I try to use PDF::Reader.new(response.body) and PDF::Reader.new(response.body_io) depending on the different pluggable_parser configurations mentioned above
#> PDF::Reader::MalformedPDFError: PDF does not contain EOF marker

I'm able to save the PDF locally and view it and it looks fine, but the PDF::Reader gem is complaining about it missing an EOF marker.

So my question is: is there a way I could add an EOF marker into the PDF or something to get around this error so I can parse the PDF?

Thanks.

Related (unanswered) question: PDF does not contain EOF marker (PDF::Reader::MalformedPDFError) with pdf-reader

Related Docs:

EDIT:

I found the EOF marker somewhere in the middle of the downloaded file contents, followed by some HTML-looking stuff that I can't seem to figure out how to get rid of. I want to isolate the PDF content and then parse that, but still running into issues. Here is the full script I am using: https://gist.github.com/s2t2/c6766846d024edd696586b2bc7fee0bf

Upvotes: 0

Views: 2813

Answers (1)

Myst
Myst

Reputation: 19221

The issue seems to be with the website you're accessing: http://employmentsummary.abaquestionnaire.org

The add HTML data at the end of the response.

However, you could truncate the response by searching for the first substring %EOF and removing all the data after that.

i.e.:

pdf_data = result.body
pdf_data.slice!(0, pdf_data.index("%EOL").to_i + 4)
if(pdf_data.length <= 4)
   # handle error
else
   # save/send pdf_data
end

Upvotes: 0

Related Questions