Muhammad Imran Saeed
Muhammad Imran Saeed

Reputation: 613

JODConverter - PDF to HTML converting to garbage data

Hi I'm trying to use jodconverter 3.0 to convert pdf files to html. The resulting html file contains junk characters meaning the conversion is not succesful. Can some one help me know what's happening.

Here is the code snippet:

OfficeManager officeManager = new DefaultOfficeManagerConfiguration().buildOfficeManager();
officeManager.start();
OfficeDocumentConverter converter = new
OfficeDocumentConverter(officeManager);
converter.convert(inputFile, outputFile);
officeManager.stop();

where inputFile = "test.pdf" and outputFile = "test.html" created using File = new File(...) ;

Sample from output file:

%PDF-1.4 %Çì�¢ 5 0 obj <</Length 6 0 R/Filter /FlateDecode>> stream
xœÅ][“#·q.[¢Ì,U’/’,˦sìÄÉ9        ÏxpÇDOVh;NUª,{“<ˆ~X.wIƼ./²þF¬#œ##—Æ
13gIFÒ#8#h4€Æ×#4°O7}Çø¦wÿÇÂéã_þÁlî>;zº‘\�#-ç#Ɇn#ôFIfÇZvsóñÑçG¾ùæ#¿
#ªZ³íó�ì˜Ô½†�#&–#µ½=Rê •ŸîöªS¦g#õ:åÉ•þ6WŒm7éÇŸ¥ÒÏ}        Æ¿ý»ÜàçéçÜÇÇD#3|æ5¡Jï¤G ›dÑQË?ÿ"0e¢pø©ú‡‘Anyñù#Y9H‡#&
…ÿü��½[[ôñÝDáÖ.Šƒ�‘¸•#w3¥##w[\KãwºÛÉ?sÓÀ¬ÑÃöŸÜ#A4´�Ýœ¾###ü<=#`#
À####IÍCùA(#­]Ù×#Ë÷Žþ{óh%#Q¬K#A]°þ        À¶#L*##¥4¬ƒLü}þj�##á{SCê
‡¡Ã/"d½—`(# '`d»‡�0~       
ó3.#ï�ÏnÔ˜=Ì›ƒ(#Õ…)Ú½½ãÆtli##l#…9Úþrq#RöN<ð(®
£ž¯ïöCÇ•„ÙïÓˆ®_A#cî#Ÿ=_ät0®;Äé•d¤Á¶äÌ#p=�Ûҗö#»epe_g,#´-éiP=ìÃb#ð¸òb2î
—Щ«­(#Nõ=Úº—²‚% Ã#Ui×�AËÞ#s¶qý:Ã#xø

Upvotes: 2

Views: 1281

Answers (1)

Olivier Masseau
Olivier Masseau

Reputation: 854

You cannot convert PDF to HTML or another format with OpenOffice. You can convert TO pdf but not FROM. What you get there is just the same content as if you were opening the PDF in notepad.

You could instead use the iText library to parse the PDF file and create the HTML with the parsed text. It could be a bit tricky if you need to keep the original formatting.

Maybe try to have a look at this also: http://sourceforge.net/projects/pdftohtml/

Upvotes: 3

Related Questions