Arunkumar S
Arunkumar S

Reputation: 112

Extract PDF file in java and render as HTML

How to extract PDF file content in java completely as Text and render as HTML?

Not like extracting just text separately or just images separately, requirement is to display contents of PDF file (as like original file-means including images and tables right at place where it was in original file) as HTML content.

Some how same like the sample in the answer here Convert Word to HTML with Apache POI which extracts contents of MS Doc file into HTML using Apache POI.

Upvotes: 1

Views: 1495

Answers (1)

Vel Genov
Vel Genov

Reputation: 11091

Extracting data from a PDF file is fairly simple. There are multiple libraries out there that do it correctly. Extracting data, and preserving its layout, on the other hand (the workflow the OP describes) is a very difficult process. The reason behind it is simple - most* PDF files, don't really have any elements that define structure. When a PDF file, for example, displays a table, it's very easy for humans to see it, and understand this is indeed a table with some data in it. However, in the PDF file itself, this is a collection of vector lines, and some text runs in between. The PDF itself, or the PDF viewer, are not aware that this is a table. Therefore when this data is converted to HTML, we don't know that we need to draw a table, but instead see this as vector art. This is just one example of why this is difficult. There are many others that can be used to illustrate this point.

On the other hand, such a thing exists as "Tagged PDF" (section 10.7). It's a PDF where structure elements are actually defined, and extraction is fairly easy. However tagged PDF files are not as common as we would like, and in most cases you won't be guaranteed to work with one.

There are some tools on the market that use sophisticated logic to infer the structure of an untagged document. Some of them do a better job than others at this. I've worked with Adobe Acrobat, which does a decent job at creating an HTML file. There is also an offering from Datalogics (I work for Datalogics) called PDF Alchemist which converts PDF to HTML. Both of them are commercial solutions.

If you are looking for a free solution, PDFBox does a good job at extracting content from a PDF document. However, it doesn't have the ability to create an HTML file, and this is something that will have to be implemented outside of the library. I'm not aware of any free PDF to HTML solutions that do a good enough job, and I would be willing to recommend.

Upvotes: 2

Related Questions