AlliterativeAlice
AlliterativeAlice

Reputation: 12577

Get x/y and width/height of all characters in a PDF using GhostScript

I need to get the x/y, width/height, and page number of each individual character in a PDF, ideally as percentages.

Clearly, Ghost Script is able to do this as it wouldn't be possible to convert PDFs to raster images otherwise. Is there a simple way to get Ghostscript to give me this information or am I going to need to modify the source to hook into this functionality?

Upvotes: 0

Views: 312

Answers (1)

KenS
KenS

Reputation: 31199

Glyphs are rendered to bitmaps (using FreeType) and stored in the glyph cache tagged with the font and matrix so that they can be uniquely identified. When text is rendered to the page the cache is consulted first and if a hit exists that bitmap is drawn at the current point. If not then the glyph is rendered and cached.

However, extremely large point sizes are left uncached, and rendered each time to avoid filling up or overflowing the cache.

So in order to retrieve this information using Ghostscript you would need to write a device which has a set of text methods. You would need to capture the bitmaps from the glyph in order to determine the width and height of the glyphs, and the current point would give you the position on the page. The output_page method would tell you that a page had completed, so you would need to track the page number yourself.

You could look at the txtwrite device to see how text is processed, and the epswrite device to see how to retrieve bitmaps, you'll need some combination of both.

Be aware that 'text' in a PDF file need not be text. What appears to be text can be bitmaps, or vectors. Text can be encoded in unusual ways, and there may be no way to retrieve Unicode or other identifiable information about the glyphs (again the txtwrite device shows how you might extract such information if possible).

Also, fonts are not always embedded in PDF files, in which case a substitute font is used, which would mess up your width/height information.

This is quite a big project.

Upvotes: 1

Related Questions