Reputation: 325
I am trying to extract the headlines of some pdf files to sort them. Unfortunately there's a space between every letters with the spaces between words bigger than the ones between letters of the same word. Here's my extraction method:
PdfReader reader = new PdfReader(filename);
Rectangle rect = new Rectangle(0, 0, 1000, 1000);
RenderFilter regionFilter = new RegionTextRenderFilter(rect);
FontRenderFilter fontFilter = new FontRenderFilter();
FilteredTextRenderListener strategy = new FilteredTextRenderListener(
new LocationTextExtractionStrategy(), regionFilter, fontFilter);
string result = PdfTextExtractor.GetTextFromPage(reader, 1, strategy);
reader.Close();
Is there a way to filter out the smaller spaces?
Upvotes: 0
Views: 1257
Reputation: 165
iText uses the distance of the rendered glyphs as base to decide if a space is present or not. The general rule applied is, if the distance is larger than the width of a normal space, devided by 2, than a space character is recognized. While this works quite well in most cases, it doesn't work at all, if the width of a space character could not be determined for the font used. In my case the width of a space was recognized as 0, thus the smallest distance between glyphs was recognized as a space. I based my solution on another answer from mkl to a question that is very similar to yours.
In short: You need to derive from e.g. SimpleTextExtractionStrategy or LocationTextExtractionStrategy and override the methods that convert the distance between glyphs into spaces (renderText or isChunkAtWordBoundary respectively).
You can also refer to the answer I gave here or the original solution by mkl.
Upvotes: 2