JODConverter - PDF to HTML converting to garbage data

Question

Hi I'm trying to use jodconverter 3.0 to convert pdf files to html. The resulting html file contains junk characters meaning the conversion is not succesful. Can some one help me know what's happening.

Here is the code snippet:

OfficeManager officeManager = new DefaultOfficeManagerConfiguration().buildOfficeManager();
officeManager.start();
OfficeDocumentConverter converter = new
OfficeDocumentConverter(officeManager);
converter.convert(inputFile, outputFile);
officeManager.stop();

where inputFile = "test.pdf" and outputFile = "test.html" created using File = new File(...) ;

Sample from output file:

%PDF-1.4 %Çì�¢ 5 0 obj <> stream
xœÅ][“#·q.[¢Ì,U’/’,Ë¦sìÄÉ9        ÏxpÇDOVh;NUª,{“<ˆ~X.wIÆ¼./²þF¬#œ##—Æ
13gIFÒ#8#h4€Æ×#4°O7}Çø¦wÿÇÂéÃ£_þÁlî>;zº‘\�#-ç#É†n#ôFIfÇZvsóñÑçG¾ùæ#¿
#ªZ³íó�ì˜Ô½†�#&–#µ½=Rê •ŸîöªS¦g#õ:åÉ•þ6WŒm7éÇŸ¥ÒÏ}        Æ¿ý»ÜàçéçÜÇÇD#3|æ5¡Jï¤G ›dÑQË?ÿ"0e¢pø©ú‡‘Anyñù#Y9H‡#&
…ÿü��½[[ôñÝDáÖ.Šƒ�‘¸•#w3¥##w[\KãwºÛÉ?sÓÀ¬ÑÃöŸÜ#A4´�Ýœ¾###ü<=#`#
À####IÍCùA(#]Ù×#Ë÷Žþ{óh%#Q¬K#A]°þ        À¶#L*##¥4¬ƒLü}þj�##á{SCê
‡¡Ã/"d½—`(# '`d»‡�0~       
ó3.#ï�ÏnÔ˜=Ì›ƒ(#Õ…)Ú½½ãÆtli##l#…9Úþrq#RöN<ð(®
£ž¯ïöCÇ•„ÙïÓˆ®_A#cî#Ÿ=_ät0®;Äé•d¤Á¶äÌ#p=�ÛÒ—Ã¶#»epe_g,#´-éiP=ìÃb#ð¸òb2î
—Ð©«(#Nõ=Úº—²‚% Ã#Ui×�AËÞ#s¶qý:Ã#xø

Olivier Masseau · Accepted Answer

You cannot convert PDF to HTML or another format with OpenOffice. You can convert TO pdf but not FROM. What you get there is just the same content as if you were opening the PDF in notepad.

You could instead use the iText library to parse the PDF file and create the HTML with the parsed text. It could be a bit tricky if you need to keep the original formatting.

Maybe try to have a look at this also: http://sourceforge.net/projects/pdftohtml/

JODConverter - PDF to HTML converting to garbage data

Answers (1)

Related Questions