Matheus
Matheus

Reputation: 33

How to test the content of a PDF file

I'm trying to access http://www.orimi.com/pdf-test.pdf to test if "PDF Test File" exists.

This is my code:

it 'pdf test' do
        visit 'http://www.orimi.com/pdf-test.pdf'
        puts page.title
        sleep 5
        convert_pdf_to_page
        expect(page).to have_content 'PDF Test File'
end

def convert_pdf_to_page
        temp_pdf = Tempfile.new('pdf')
        temp_pdf << page.source.force_encoding('UTF-8')
        reader = PDF::Reader.new(temp_pdf)
        pdf_text = reader.pages.map(&:text)
        temp_pdf.close
        page.driver.response.instance_variable_set('@body', pdf_text)
end

But I got:

PDF::Reader::MalformedPDFError: PDF does not contain EOF marker

I searched and I found that the problem can be the PDF file. I checked the temp_pdf variable and there is just HTML with a empty body.

Is there something wrong in my code?

Upvotes: 3

Views: 2233

Answers (1)

Greg
Greg

Reputation: 6648

PDF is a tricky format, and different readers react differently to unexpected content in the PDF files. Some would crash, others would make assumptions to not crash.

I'd guess this is what happens here. When you open the file in the browser/pdf reader it works, but PDF::Reader can't handle whatever is not-standard there.

Try using different gem, Origami seems to have good opinions. I tried it with your file, and it seems to work:

> require 'origami'
> pdf = Origami::PDF.read '/tmp/pdf-test.pdf'
> pdf.grep(/Not existing/).any?
=> false
> pdf.grep(/PDF Test File/).any?
=> true

For reference (how I came up with this answer):

I googled the PDF::Reader::MalformedPDFError: PDF does not contain EOF marker and found this thread, which suggests that it's a more common problem with "working" PDFs. One of the last messages suggests the Origami, which (after checking) seems to be able to handle the PDF in question.

Upvotes: 1

Related Questions