How to use PDFTextExtractor on iTextSharp

Question

I want to retrieve the text from a pdf file using iTextSharp. However, I wasn't able to use PDFTextExtractor as in JAVA library of itextsharp(itext). I need readPDFOffline class to return content of file. I will give the pseudo below for you to understand well what I want.

private string readPDFOffline(string fileUri);
read PDF;
retrieve Text Content of This Pdf;*
save content into string contentOfflineFile;
return contentOfflineFile;

I would like to do the * part of Code

Mark Storer · Accepted Answer

PdfTextExtractor is present in the most recent releases of iTextSharp, available here.

Retrieving text in PDF is not easy. Not impossible, but there are times when the only thing that will work is OCR. For all other cases, PdfTextExtractor should work. Cases of it not working are considered bugs and should be reported as such.

Be aware that there are several cases where what looks like valid text is not extractable:

Text with no encoding... just glyph indexes. OCR time.
"Text" that is just raw paths. Horribly inefficient, and time for more OCR.
"Text" that is pixels in a bitmap. OCR once more.

OCR: Optical Character Recognition. There's even a reasonably good one for free available on Google Code, though I don't recall the name off the top of my head.

How to use PDFTextExtractor on iTextSharp

Answers (1)

Related Questions