Cannot read text from pdf by ITextSharp in C#

Question

I have two pdf files. On Sercurity tab both files have set Security Method: No Security and Document Assembly: Not Allowed and page Extraction: Not Allowed. Other items are allowed. I using standart ITextSharp method to retrieve text from pdf:

PdfReader pdfReader = new PdfReader(fileName);

                for (int page = 1; page <= pdfReader.NumberOfPages; page++)
                {
                    ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy(); //LocationTextExtractionStrategy();
                    string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);

                    currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
                    text.Append(currentText);

From first file i can get currentText wihtout any problem from second file I cannot retrieve text, currentText is empty. I was trying with LocationTextExtractionStrategy, but result is the same. I opened this file in SodaPDF and convert it to txt file but this file is empty too (while frist file is converted to txt without any problems). It is possible to read text from this file from C# or with any other application? If I buy Adobe Reader I will convert this file to txt ? What is difference between these two files ?

Thanks

Nausherwan Aslam · Accepted Answer

I work as Social Media Developer at Aspose. I would suggest you to download and try Aspose.Pdf for .NET to convert PDF to Text file. In case your file contains images and you need to extract the text from those images, you can use Aspose.Pdf to convert Pdf file to images and then perform OCR using Aspose.OCR for .NET.

Following is the sample code to convert PDf to Text using Aspose.Pdf for .NET

//open document
Document pdfDocument = new Document("input.pdf");
//create TextAbsorber object to extract text
TextAbsorber textAbsorber = new TextAbsorber();
//accept the absorber for all the pages
pdfDocument.Pages.Accept(textAbsorber);
//get the extracted text
string extractedText = textAbsorber.Text;
// create a writer and open the file
TextWriter tw = new StreamWriter("extracted-text.txt");
// write a line of text to the file
tw.WriteLine(extractedText);
// close the stream
tw.Close();

Please download a free trial and try it.

Cannot read text from pdf by ITextSharp in C#

Answers (2)

Related Questions