marsze
marsze

Reputation: 17144

Get bounds of glyphs in PDF with GemBox

Goal: extract a value from a specific location inside a PDF page. In GemBox.Pdf, I can extract text elements including their bounds and content, but:

Problem: a text element can have a complex structure, with each glyph being positioned using individual settings.

Consider this common example of a page header:

Billing Info                        Date:   02/02/20222

Company Ltd.                Order Number:    0123456789
123 Main Street                     Name:   Smith, John              

Let's say, I want to get the order number (0123456789) from the document, knowing its precise position on the page. But in practice, often enough the entire line would be one single text element, with the content SO CompanyOrder Number:0123456789, and all positioning and spacing done via offsets and indices only. I can get the bounds and text of the entire line, but I need the bounds (and value) of each character/glyph, so I can combine them into "words" (= character sequences, separated by whitespace or large offsets).

I know this is definitely possible in other libraries. But this question is specific to GemBox. It seems to me, all the necessary implementations should already there, just not much is exposed in the API.

In itextsharp I can get the bounds for each single glyph, like this:

// itextsharp 5.2.1.0

public GlyphExtractionStrategy : LocationTextExtractionStrategy
{
    public override void RenderText(TextRenderInfo renderInfo)
    {
        var segment = renderInfo.GetBaseline();
        var chunk = new TextChunk(
            renderInfo.GetText(),
            segment.GetStartPoint(),
            segment.GetEndPoint(),
            renderInfo.GetSingleSpaceWidth(),
            renderInfo.GetAscentLine(),
            renderInfo.GetDescentLine()
        );
        // glyph infos
        var glyph = chunk.Text;
        var left = chunk.StartLocation[0];
        var top = chunk.StartLocation[1];
        var right = chunk.EndLocation[0];
        var bottom = chunk.EndLocation[1];
    }
}

var reader = new PdfReader(bytes);
var strategy = new GlyphExtractionStrategy();
PdfTextExtractor.GetTextFromPage(reader, pageNumber: 1, strategy);
reader.Close();

Is this possible in GemBox? If so, that would be helpful, because we already have the code to combinine the glphs into "words".

Currently, I can somewhat work around this using regex, but this is not always possible and also way too technical for end users to configure.

Upvotes: 0

Views: 460

Answers (1)

Mario Z
Mario Z

Reputation: 4381

Try using this latest NuGet package, we added PdfTextContent.GetGlyphOffsets method:

Install-Package GemBox.Pdf -Version 17.0.1128-hotfix

Here is how you can use it:

using (var document = PdfDocument.Load("input.pdf"))
{
    var page = document.Pages[0];
    var enumerator = page.Content.Elements.All(page.Transform).GetEnumerator();

    while (enumerator.MoveNext())
    {
        if (enumerator.Current.ElementType != PdfContentElementType.Text)
            continue;

        var textElement = (PdfTextContent)enumerator.Current;
        var text = textElement.ToString();

        int index = text.IndexOf("Number:");
        if (index < 0)
            continue;

        index += "Number:".Length;
        for (int i = index; i < text.Length; i++)
        {
            if (text[i] == ' ')
                index++;
            else
                break;
        }

        var bounds = textElement.Bounds;
        enumerator.Transform.Transform(ref bounds);
                
        string orderNumber = text.Substring(index);
        double position = bounds.Left + textElement.GetGlyphOffsets().Skip(index - 1).First();

        // TODO ...
    }
}

Upvotes: 1

Related Questions