Reputation: 2701
I have two pdf files. On Sercurity tab both files have set Security Method: No Security and Document Assembly: Not Allowed and page Extraction: Not Allowed. Other items are allowed. I using standart ITextSharp method to retrieve text from pdf:
PdfReader pdfReader = new PdfReader(fileName);
for (int page = 1; page <= pdfReader.NumberOfPages; page++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy(); //LocationTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
text.Append(currentText);
From first file i can get currentText wihtout any problem from second file I cannot retrieve text, currentText is empty. I was trying with LocationTextExtractionStrategy, but result is the same. I opened this file in SodaPDF and convert it to txt file but this file is empty too (while frist file is converted to txt without any problems). It is possible to read text from this file from C# or with any other application? If I buy Adobe Reader I will convert this file to txt ? What is difference between these two files ?
Thanks
Upvotes: 0
Views: 3977
Reputation: 458
I work as Social Media Developer at Aspose. I would suggest you to download and try Aspose.Pdf for .NET to convert PDF to Text file. In case your file contains images and you need to extract the text from those images, you can use Aspose.Pdf to convert Pdf file to images and then perform OCR using Aspose.OCR for .NET.
Following is the sample code to convert PDf to Text using Aspose.Pdf for .NET
//open document
Document pdfDocument = new Document("input.pdf");
//create TextAbsorber object to extract text
TextAbsorber textAbsorber = new TextAbsorber();
//accept the absorber for all the pages
pdfDocument.Pages.Accept(textAbsorber);
//get the extracted text
string extractedText = textAbsorber.Text;
// create a writer and open the file
TextWriter tw = new StreamWriter("extracted-text.txt");
// write a line of text to the file
tw.WriteLine(extractedText);
// close the stream
tw.Close();
Please download a free trial and try it.
Upvotes: 0
Reputation: 1325
There may be a lot of pdf which actually are images. You cannot extracttext from imaged pdf as Bruno Lowagie said. you need to go for third party OCR for this.
you ca use Adobe Acrobat to convert the pdf to editable format like word, html..
Upvotes: 1