Reputation: 21
How to convert a PDF to XML and capture its structure/styling in XSL?
Upvotes: 2
Views: 6058
Reputation: 5916
PDFTextStream can readily extract text from PDF documents as XML. One particular PDF->XML approach is included with PDFTextStream — XMLOutputTarget — the source for which is included with PDFTextStream so you can easily tweak it to suit your requirements.
Code samples are available to get started, or you can read more in-depth about how PDF text extraction with PDFTextStream works.
(Disclosure: I am employed by Snowtide, the makers of PDFTextStream. I hope this pointer is helpful in any case.)
Upvotes: 2
Reputation: 52888
I think Michael Kay nailed it when he described PDF -> XML conversion as 'trying to convert hamburgers into cows'.
I've done quite a bit of PDF to XML conversion in the past. I've been lucky in that I've got decent PDFs to convert that didn't require OCR. Most of my issues were around tables and graphics. Converting to Word first like Michael suggests may help with those.
What I did was convert the PDF to text using pdftotext
from Xpdf and then convert the text to XML. (I used Omnimark for the text -> XML conversion, but you could probably use Java or Python to do the conversion. It might be easiest to convert to a basic structure and then use XSLT (2.0!) to fine tune it.
Upvotes: 0
Reputation: 163645
I once described PDF-to-XML conversion as trying to convert hamburgers into cows. It's an exercise in reverse engineering. PDF is very variable in the way it represents text; in the worse case, all you have is a scanned image (in which case you are essentially doing OCR). If you're lucky, you have a collection of strings of text with the coordinates of where they appear on the page, but no other indication of structure.
There are tools that do a reasonable job (usually producing Microsoft Word) if the PDF is in a form that they understand. Google "PDF to Word conversion". Try them out (it's a while since I did so); don't try to write your own. From Word, of course, getting to XML is "relatively" straightforward.
Upvotes: 5