Charlie
Charlie

Reputation: 11767

Convert PDF to HTML in PHP?

I want to be able to convert a PDF file to an HTML file via PHP, but am running into some trouble.

I found a basic way to do this using Saaspose, which lets you convert PDF's to HTML files. There are some problems with this, however, such as the use of SVGs, images, positioning, fonts, etc.

All I would need is the ability to grab the text from the PHP file and any images associated with it, and then display it in a linear format as opposed to it being formatted with absolute positioning.

What I mean by this is that if the PDF looks like this:

enter image description here

I'd want to convert it to a single column design HTML file. If there were images, I'd want them returned as well.

Is this possible in PHP? I know I can simply grab the text from the PDF file, but what about grabbing images as well?

Another problem is that I want everything to be inline, as it's being served to the client in a single file. Currently, I can do this with my setup through some code:

for ($i = 0; $i < $object_number; $i++) {
                $object = $html->find("object")->find("embed")->eq($i);
                $embed = file_get_contents("Output/OutputHtml/" . $object->attr("src"));
                array_push($converted_obj, $embed);
                array_push($original_obj, $object);
            }

            for ($i = 0; $i < $object_number; $i++){
                pq($original_obj[$i])->replaceWith($converted_obj[$i]);
            }

Which grabs all the SVG files and displays them inline. Images would be easier for this, as I could use base64.

Upvotes: 16

Views: 93688

Answers (4)

hindmost
hindmost

Reputation: 7195

Cross-platform solution using Xpdf:

Download appropriate package of the Xpdf tools and unpack it into a subdirectory in your script's directory. Let's assume it's called "/xpdftools".

Add such a code into your php script:

$pdf_file = 'sample.pdf';
$html_dir = 'htmldir';
$cmd = "xpdftools/bin32/pdftohtml $pdf_file $html_dir";

exec($cmd, $out, $ret);
echo "Exit code: $ret";

After successful script execution htmldir directory will contain converted html files (each page in a separate file).

The Xpdf tools use the following exit codes:

  • 0 - No error.
  • 1 - Error opening a PDF file.
  • 2 - Error opening an output file.
  • 3 - Error related to PDF permissions.
  • 99 - Other error.

Upvotes: 7

T.Todua
T.Todua

Reputation: 56341

1) download and unpack the .exe file to a folder: http://sourceforge.net/projects/pdftohtml/

2) create a .php file, and put this code (assuming, that the pdftohtml.exe is inside that folder, and the source sample.pdf too):

<?php
$source_pdf="sample.pdf";
$output_folder="MyFolder";

    if (!file_exists($output_folder)) { mkdir($output_folder, 0777, true);}
$a= passthru("pdftohtml $source_pdf $output_folder/new_file_name",$b);
var_dump($a);
?>

3) enter MyFolder, and you will see the converted files (depends on the number of pages..)

p.s. i dont know, but there exists many commercial or trial apis too.

Upvotes: 16

Heather McVay
Heather McVay

Reputation: 949

What you are wanting to achieve from the graphic you posted is actually OCR conversion of a graphic. http://www.phpclasses.org/package/2874-PHP-Recognize-text-objects-in-graphical-images.html

Upvotes: -1

user1444410
user1444410

Reputation:

What you are essentially looking to do is to reflow the PDF file. I'm not sure this exists, and is at best very difficult to do.

It would be possible to write some code to do what you need for your specific file, but to do so for a general case I believe would be impossible.

I have written an article here that explains why I believe reflowing PDF is flawed: http://www.planetpdf.com/enterprise/article.asp?ContentID=PDF_Reflow_in_Microsoft_Word_2012_Is_it_any_good

Of particular interest is the paragraph beginning "Let's use a newspaper story to illustrate the problem."

You may want to look into what IDRsolutions (which for transparency, is where I work!) has to offer.

We are currently in the process of putting our PDF to HTML5 and PDF Conversion software in the cloud: http://www.idrsolutions.com/cloud-pdf-converter/

What may be a better fit for you is the PDF text extraction and PDF image extraction functionality of JPedal. It's quite likely we will look at putting this in the cloud also, if the PDF to HTML5 goes well.

Text Extraction: http://www.idrsolutions.com/pdf-to-text-conversion/

Image Extraction: http://www.idrsolutions.com/extract-images-from-pdf/

Upvotes: 1

Related Questions