Deepak Talape
Deepak Talape

Reputation: 997

Tesseract-ocr is not working properly after integrating with alfresco 5.0.d

I have integrated Tesseract-ocr in Alfresco 5.0.d, My requirement is to convert PDF file data in to text format.

And Its working fine for small sized files.

But if i will upload larger size files, say more than 50 MB,

In that case its giving below Exception, and whole pdf file is not get converted in to text file. Only some starting pages are getting converted to text format.

Please refer the below logs

java.net.SocketTimeoutException: Read timed out
    at java.net.SocketInputStream.socketRead0(Native Method)
    at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
    at java.net.SocketInputStream.read(SocketInputStream.java:170)
    at java.net.SocketInputStream.read(SocketInputStream.java:141)
    at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
    at sun.security.ssl.InputRecord.read(InputRecord.java:503)

Does Anyone have faced the same issue, Please help me.

Thanks in advance.

Upvotes: 1

Views: 700

Answers (2)

Vikash Patel
Vikash Patel

Reputation: 1346

You may have to increase the content transformation size of pdf to text in alfresco-global.properties file

you can give size for transformation using these properties

if you are using OOoDirect

content.transformer.complex.OpenOffice.Pdf2swf.extensions.doc.swf.maxSourceSizeKBytes=5120 content.transformer.complex.OpenOffice.Pdf2swf.extensions.docx.swf.maxSourceSizeKBytes=5120

if you are using OOoJodConverter

content.transformer.complex.JodConverter.Pdf2swf.extensions.doc.swf.maxSourceSizeKBytes=5120
content.transformer.complex.OpenOffice.Pdf2swf.extensions.docx.swf.maxSourceSizeKBytes=5120

refer this community question https://community.alfresco.com/thread/211670-changing-transformation-limits-version-5b

https://community.alfresco.com/thread/203406-how-to-config-alfresco-documents-preview-size-limit-on-42d

https://injustfiveminutes.wordpress.com/2012/11/28/docx-pptx-document-preview-fails-on-alfresco-4-2-c/

Upvotes: 2

Ben Chevallereau
Ben Chevallereau

Reputation: 439

I'm a bit surprised. Alfresco already includes PDFBox who is in charge of doing PDF --> TXT conversion. And so you don't need to use Tesseract. Even your trace seems a bit weird. To see what's going on with the transformers, set log4j.logger.org.alfresco.repo.content.transform.TransformerDebug and log4j.logger.org.alfresco.repo.content.transform equals to DEBUG.

Upvotes: 2

Related Questions