user1528782
user1528782

Reputation: 21

Conversion of PDF TO XML-XSL - Java?

How to convert a PDF to XML and capture its structure/styling in XSL?

Upvotes: 2

Views: 6058

Answers (3)

cemerick
cemerick

Reputation: 5916

PDFTextStream can readily extract text from PDF documents as XML. One particular PDF->XML approach is included with PDFTextStream — XMLOutputTarget — the source for which is included with PDFTextStream so you can easily tweak it to suit your requirements.

Code samples are available to get started, or you can read more in-depth about how PDF text extraction with PDFTextStream works.

(Disclosure: I am employed by Snowtide, the makers of PDFTextStream. I hope this pointer is helpful in any case.)

Upvotes: 2

Daniel Haley
Daniel Haley

Reputation: 52888

I think Michael Kay nailed it when he described PDF -> XML conversion as 'trying to convert hamburgers into cows'.

I've done quite a bit of PDF to XML conversion in the past. I've been lucky in that I've got decent PDFs to convert that didn't require OCR. Most of my issues were around tables and graphics. Converting to Word first like Michael suggests may help with those.

What I did was convert the PDF to text using pdftotext from Xpdf and then convert the text to XML. (I used Omnimark for the text -> XML conversion, but you could probably use Java or Python to do the conversion. It might be easiest to convert to a basic structure and then use XSLT (2.0!) to fine tune it.

Upvotes: 0

Michael Kay
Michael Kay

Reputation: 163645

I once described PDF-to-XML conversion as trying to convert hamburgers into cows. It's an exercise in reverse engineering. PDF is very variable in the way it represents text; in the worse case, all you have is a scanned image (in which case you are essentially doing OCR). If you're lucky, you have a collection of strings of text with the coordinates of where they appear on the page, but no other indication of structure.

There are tools that do a reasonable job (usually producing Microsoft Word) if the PDF is in a form that they understand. Google "PDF to Word conversion". Try them out (it's a while since I did so); don't try to write your own. From Word, of course, getting to XML is "relatively" straightforward.

Upvotes: 5

Related Questions