Reputation: 17144
Goal: extract a value from a specific location inside a PDF page. In GemBox.Pdf
, I can extract text elements including their bounds and content, but:
Problem: a text element can have a complex structure, with each glyph being positioned using individual settings.
Consider this common example of a page header:
Billing Info Date: 02/02/20222
Company Ltd. Order Number: 0123456789
123 Main Street Name: Smith, John
Let's say, I want to get the order number (0123456789
) from the document, knowing its precise position on the page. But in practice, often enough the entire line would be one single text element, with the content SO CompanyOrder Number:0123456789
, and all positioning and spacing done via offsets and indices only. I can get the bounds and text of the entire line, but I need the bounds (and value) of each character/glyph, so I can combine them into "words" (= character sequences, separated by whitespace or large offsets).
I know this is definitely possible in other libraries. But this question is specific to GemBox
. It seems to me, all the necessary implementations should already there, just not much is exposed in the API.
In itextsharp
I can get the bounds for each single glyph, like this:
// itextsharp 5.2.1.0
public GlyphExtractionStrategy : LocationTextExtractionStrategy
{
public override void RenderText(TextRenderInfo renderInfo)
{
var segment = renderInfo.GetBaseline();
var chunk = new TextChunk(
renderInfo.GetText(),
segment.GetStartPoint(),
segment.GetEndPoint(),
renderInfo.GetSingleSpaceWidth(),
renderInfo.GetAscentLine(),
renderInfo.GetDescentLine()
);
// glyph infos
var glyph = chunk.Text;
var left = chunk.StartLocation[0];
var top = chunk.StartLocation[1];
var right = chunk.EndLocation[0];
var bottom = chunk.EndLocation[1];
}
}
var reader = new PdfReader(bytes);
var strategy = new GlyphExtractionStrategy();
PdfTextExtractor.GetTextFromPage(reader, pageNumber: 1, strategy);
reader.Close();
Is this possible in GemBox? If so, that would be helpful, because we already have the code to combinine the glphs into "words".
Currently, I can somewhat work around this using regex, but this is not always possible and also way too technical for end users to configure.
Upvotes: 0
Views: 460
Reputation: 4381
Try using this latest NuGet package, we added PdfTextContent.GetGlyphOffsets
method:
Install-Package GemBox.Pdf -Version 17.0.1128-hotfix
Here is how you can use it:
using (var document = PdfDocument.Load("input.pdf"))
{
var page = document.Pages[0];
var enumerator = page.Content.Elements.All(page.Transform).GetEnumerator();
while (enumerator.MoveNext())
{
if (enumerator.Current.ElementType != PdfContentElementType.Text)
continue;
var textElement = (PdfTextContent)enumerator.Current;
var text = textElement.ToString();
int index = text.IndexOf("Number:");
if (index < 0)
continue;
index += "Number:".Length;
for (int i = index; i < text.Length; i++)
{
if (text[i] == ' ')
index++;
else
break;
}
var bounds = textElement.Bounds;
enumerator.Transform.Transform(ref bounds);
string orderNumber = text.Substring(index);
double position = bounds.Left + textElement.GetGlyphOffsets().Skip(index - 1).First();
// TODO ...
}
}
Upvotes: 1