Reputation: 903
I'm looking for a way to extract text from a pdf and use it i a program. I've done some research on the net and got a few libraries working. These were not freeware; however, en bumbed on there limits.
So i'm looking for a free library. I thought of ITextSharp but i have no idea to get started. Can you guys help me out here?
Upvotes: 3
Views: 9398
Reputation: 837
Something like should work for you. You have to watch it - they change function names all the time with iTextSharp releases, which is a bit annoying - Lol
public static string GetPDFText(String pdfPath)
{
PdfReader reader = new PdfReader(pdfPath);
StringWriter output = new StringWriter();
for (int i = 1; i <= reader.NumberOfPages; i++)
output.WriteLine(PdfTextExtractor.GetTextFromPage(reader, i, new SimpleTextExtractionStrategy()));
return output.ToString();
}
Upvotes: 3
Reputation: 55427
iTextSharp is open source but the licensing model changed after version 4.1.6. The old license was much less strict while the new one requires payment if you use it commercially and don't want to release your source code. This may or may not affect you.
Here's the most basic version of text extraction using the 5.1.2.0 version:
//Full path to the file to read
string fileToRead = System.IO.Path.Combine(Environment.GetFolderPath(Environment.SpecialFolder.Desktop), @"file1.pdf");
//Bind a PdfReader to our file
iTextSharp.text.pdf.PdfReader reader = new iTextSharp.text.pdf.PdfReader(fileToRead);
//Extract all of the text from the first page
string allPage1Text = iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(reader, 1);
//That's it!
Console.Write(allPage1Text);
Upvotes: 0