Reputation: 31
First some background: My site has two basic types of users. Users with free accounts can upload documents and paid customers can then search and view or download those documents. Uploaders can view only the documents they own while paid customers can view anything. Currently we only support Word documents (either .doc or .docx) and plain text. We use the JODConverter library to convert between Word and html; the html is what's stored in the database and what's displayed to users.
We want to move to accepting PDFs as well but I'm not sure what's the best way to go about either displaying the PDFs or converting them to html. I've seen suggestions to use Google docs to do the conversion on the fly but it doesn't seem feasible to restrict access properly given that the document has to be publicly accessible to Google - please correct me if I'm wrong. It seems like simply using an tag in the html (or something like PDFBox) would run into the same problem.
Alternatively we could forget displaying the PDF files directly and convert them into html like we do with Word documents but I've yet to come across a decent-looking library for that. Everything I've looked at so far seems to say it doesn't do that great of a job converting, is Window-only and/or has a hefty licensing fee. (A licensing fee isn't necessarily a deal-breaker if it's not more than $100 / year or so.) Does anyone know of a good Java conversion library? (Something that runs via command-line would be acceptable if it actually does a good job.)
One last thing, we plan to offer the paid customers the option to download the original PDF files. Is that likely to be complicated? Is there anything I should be keeping in mind when building the rest of the process?
Upvotes: 3
Views: 116
Reputation: 2006
Instead of converting PDF into HTML which means some kind of OCR (recognizing the text), you can convert the PDF into images via tools like JPedal and create a HTML page which links to those images in a sequential order. Since this is java library, it's not windows only.
Downloading original PDF files shouldn't be a problem. You have to just set the mimetype to standard PDF extension: application/pdf in the header.
Upvotes: 1