Extracting the text represented as an image inside the PDF- itextsharp

Question

I am in the process of extracting text from a PDF file using ITextSharp, I have extracted successfully some part of the text that I was interested in but when I carried on with my 'text' extraction, I have noticed that some TEXT WORDS ( which I could not get as text while extracting the whole text from an entire page using itextsharp) were actually represented as IMAGES. This has been confirmed to me by Adobe Reader. So, in shorter terms: how can I extract the text contained in a PDF Image object? Do I have to extract the image and find another way to convert it as text? This is a very awful planets alignment for me.. Anyone had this problem?

neminem · Accepted Answer

I would say yes, you would have to find another way: if the "text" in a pdf isn't actually in the text layer at all, but is only an image that represents some text, you would have to extract the images and then run OCR (optical character recognition, the term for generating text from images) on them. ITextSharp is not an OCR engine. (But some free OCR engines do exist, if you look.)

Extracting the text represented as an image inside the PDF- itextsharp

Answers (1)

Related Questions