user8862290
user8862290

Reputation: 81

How to embed OCR text from TIFF images into PDF documents using Java and OpenPDF?

I have scanned paper documents and OCR'ed them into text. I am creating PDF documents from groups of the TIFF images. I am using Java and OpenPDF. How can I embed the text behind the tiff image when creating a PDF in a manner that allows the document which only displays the TIFF images to be text searchable? More than 10 years ago I had the code to do this but long since removed it and no version of it remains. The internet seems to be forgetting these things now. I searched high and low and cannot find an example of how to do this. Here is what I am currently using to build the PDF. I retrieve the raw data from the TIFF image and write into the PdfWriter document. The problem is that the text is not behind the image or hidden, it is on a separate page altogether and I cannot find an API to specify the text location as well.

    // step 1: creation of a document-object
    Document document = new Document( );
        document.setMargins( 0, 0, 0, 0 );
    // create a PDF-stream object.
    PdfWriter.getInstance( document, bout );
    document.open( );
    for( int x = 0; x < parts.length; x++ )
    {
        ImageData data = getImageData( db, parts[x].getImageId( ), ImageData.TIFF_FORMAT, usesClob );
        ImageInputStream iis = ImageIO.createImageInputStream( data.toStream( true ) );
        ImageReader reader = getTiffImageReader();
            reader.setInput(iis);               
            int pages = reader.getNumImages(true);
            for (int imageIndex = 0; imageIndex < pages; imageIndex++) 
            {
                BufferedImage bufferedImage = reader.read(imageIndex);
                Image image = Image.getInstance(bufferedImage, null, false);
            image.scaleToFit( document.getPageSize( ).getWidth( ), 
                            document.getPageSize( ).getHeight( ) );
                document.add(image);
                // Add the text behind the image.
                String pageId = parts[x].getId( );
                ArrayList<ImageDataText> blocks = getPageImageDataText( db, pageId );
                for( ImageDataText block : blocks )
                {
                    Paragraph p = new Paragraph();
                    for( LineText text : block.getLineText( ) )
                    {
                        p.add( new Chunk( text.getLine( ) ) );
                    }
                    document.add( p );
                }
                // new page.
                document.newPage();
            }
    }
    document.close( );
    bout.writeTo( out );
    out.flush( );
    out.close( );

Upvotes: 2

Views: 185

Answers (0)

Related Questions