Reputation: 81
I have scanned paper documents and OCR'ed them into text. I am creating PDF documents from groups of the TIFF images. I am using Java and OpenPDF. How can I embed the text behind the tiff image when creating a PDF in a manner that allows the document which only displays the TIFF images to be text searchable? More than 10 years ago I had the code to do this but long since removed it and no version of it remains. The internet seems to be forgetting these things now. I searched high and low and cannot find an example of how to do this. Here is what I am currently using to build the PDF. I retrieve the raw data from the TIFF image and write into the PdfWriter document. The problem is that the text is not behind the image or hidden, it is on a separate page altogether and I cannot find an API to specify the text location as well.
// step 1: creation of a document-object
Document document = new Document( );
document.setMargins( 0, 0, 0, 0 );
// create a PDF-stream object.
PdfWriter.getInstance( document, bout );
document.open( );
for( int x = 0; x < parts.length; x++ )
{
ImageData data = getImageData( db, parts[x].getImageId( ), ImageData.TIFF_FORMAT, usesClob );
ImageInputStream iis = ImageIO.createImageInputStream( data.toStream( true ) );
ImageReader reader = getTiffImageReader();
reader.setInput(iis);
int pages = reader.getNumImages(true);
for (int imageIndex = 0; imageIndex < pages; imageIndex++)
{
BufferedImage bufferedImage = reader.read(imageIndex);
Image image = Image.getInstance(bufferedImage, null, false);
image.scaleToFit( document.getPageSize( ).getWidth( ),
document.getPageSize( ).getHeight( ) );
document.add(image);
// Add the text behind the image.
String pageId = parts[x].getId( );
ArrayList<ImageDataText> blocks = getPageImageDataText( db, pageId );
for( ImageDataText block : blocks )
{
Paragraph p = new Paragraph();
for( LineText text : block.getLineText( ) )
{
p.add( new Chunk( text.getLine( ) ) );
}
document.add( p );
}
// new page.
document.newPage();
}
}
document.close( );
bout.writeTo( out );
out.flush( );
out.close( );
Upvotes: 2
Views: 185