dock
dock

Reputation: 11

How can I get max fontsize of a pdf using pdfbox

  1. I use pdfbox extraction for some information from a pdf, but how can I extract every objects information? If one of them contains the stream, how can I decode the stream to display?

  2. Can I get the maximum fontsize from a pdf box? I think if I can get every objects fontsizes and sort them, then I get the object which has the maximum fontsize?

Upvotes: 1

Views: 1565

Answers (1)

mkl
mkl

Reputation: 95918

I use pdfbox extraction some informaton of a pdf. But how can I extraction every objects' information.if one of them contains the stream, how can I decode the stream to display.

If by every object you mean everything drawn as part of the page content, these objects are contained in the page content streams and in referenced Xobject streams. You can parse these streams using a parser derived from the PDFStreamEngine class.

That class already does most of the heavy-lifting like retrieving individual operations from the streams, managing a stack of graphic states, etc. You will have to supply some callbacks, though, for operations drawing the objects you are interested in.

To get an idea how to extend that class properly, have a look at some subclasses provided with PDFBox, e.g. PDFTextStripper, PDFMarkedContentExtractor, or PageDrawer.

Can I get the maximum fontsize from a pdf box? I think if I can get every objects' fontsizes and sort them, then i get the object which has the maximum fontsize?

Indeed, you can use the above-mentioned PDFTextStripper or more exactly, you can use a class derived from it. The text stripper as is mainly returns plain text but you can override certain of its methods and get text with additional information.

E.g. you can override writeString(String text, List<TextPosition> textPositions). Its standard implementation only uses the text, i.e. the extracted plain text, but you can inspect the textPositions, i.e. text with extra information, among them position and size.

This answer shows how to override PDFTextStripper.writeString get access the font name. Similarly you can access the font size. Beware, there are two TextPosition methods for this, getFontSize and getFontSizeInPt, and you might actually need yet another kind of size.

EDIT

In a comment, the OP asked

How can I get start with PDFSteamEngine???

As mentioned above, have a look at some subclasses provided with PDFBox. The most prominent surely is the PDFTextStripper.

The most simple PDFTextStripper use looks like this:

PDFTextStripper stripper = new PDFTextStripper();
stripper.setSortByPosition(true);

PDDocument document = PDDocument.load(PDF_DOCUMENT);
String text = stripper.getText(document);
document.close();

This only extracts the plain text of the document. For more specialized tasks look at these sample usages:

More usage examples of PDFStreamEngine and other sub-classes:

How can I obtain the Textposition from a PDF???

As mentioned in my original answer, use a PDFTextStripper and override writeString(String text, List<TextPosition> textPositions). Its standard implementation only uses the text, i.e. the extracted plain text, but you can inspect the textPositions, i.e. text with extra information, among them position and size.

Upvotes: 3

Related Questions