derBasti
derBasti

Reputation: 325

Extract text from pdf by format

I am trying to extract the headlines from pdfs. Until now I tried to read the plain text and take the first line (which didn't work because in plain text the headlines were not at the beginning) and just read the text from a region (which didn't work, because the regions are not always the same).

The easiest way to do this is in my opinion to read just text with a special format (font, fontsize etc.). Is there a way to do this?

Upvotes: 0

Views: 5084

Answers (1)

Bobrovsky
Bobrovsky

Reputation: 14246

You can enumerate all text objects on a PDF page using Docotic.Pdf library. For each of the text objects information about the font and the size of the object is available. Below is a sample

public static void listTextObjects(string inputPdf)
{
    using (PdfDocument pdf = new PdfDocument(inputPdf))
    {
        string format = "{0}\n{1}, {2}px at {3}";

        foreach (PdfPage page in pdf.Pages)
        {
            foreach (PdfPageObject obj in page.GetObjects())
            {
                if (obj.Type != PdfPageObjectType.Text)
                    continue;

                PdfTextData text = (PdfTextData)obj;

                string message = string.Format(format, text.Text, text.Font.Name,
                    text.Size.Height, text.Position);
                Console.WriteLine(message);
            }
        }
    }
}

The code will output lines like the following for each text object on each page of the input PDF file.

FACTUUR
Helvetica-BoldOblique, 19.04px at { X=51.12; Y=45.54 }

You can use the retrieved information to find largest text or bold text or text with other properties used to format the headline.

If your PDF is guaranteed to have headline as the topmost text on a page than you can use even simpler approach

public static void printText(string inputPdf)
{
    using (PdfDocument pdf = new PdfDocument(inputPdf))
    {
        foreach (PdfPage page in pdf.Pages)
        {
            string text = page.GetTextWithFormatting();
            Console.WriteLine(text);
        }
    }
}

The GetTextWithFormatting method returns text in the reading order (i.e from left top to right bottom position).

Disclaimer: I am one of the developer of the library.

Upvotes: 3

Related Questions