I use pdfbox extraction for some information from a pdf, but how can I extract every objects information? If one of them contains the stream, how can I decode the stream to display? Can I get the maximum fontsize from a pdf box? I think if I can get every objects fontsizes and sort them, then I get the object which has the maximum fontsize?

objectfont-sizepdfboxpdf-extraction

dock

Reputation: 11

How can I get max fontsize of a pdf using pdfbox

I use pdfbox extraction for some information from a pdf, but how can I extract every objects information? If one of them contains the stream, how can I decode the stream to display?
Can I get the maximum fontsize from a pdf box? I think if I can get every objects fontsizes and sort them, then I get the object which has the maximum fontsize?

Upvotes: 1

Answers (1)

mkl

Reputation: 95918

I use pdfbox extraction some informaton of a pdf. But how can I extraction every objects' information.if one of them contains the stream, how can I decode the stream to display.

If by every object you mean everything drawn as part of the page content, these objects are contained in the page content streams and in referenced Xobject streams. You can parse these streams using a parser derived from the PDFStreamEngine class.

That class already does most of the heavy-lifting like retrieving individual operations from the streams, managing a stack of graphic states, etc. You will have to supply some callbacks, though, for operations drawing the objects you are interested in.

To get an idea how to extend that class properly, have a look at some subclasses provided with PDFBox, e.g. PDFTextStripper, PDFMarkedContentExtractor, or PageDrawer.

Can I get the maximum fontsize from a pdf box? I think if I can get every objects' fontsizes and sort them, then i get the object which has the maximum fontsize?

Indeed, you can use the above-mentioned PDFTextStripper or more exactly, you can use a class derived from it. The text stripper as is mainly returns plain text but you can override certain of its methods and get text with additional information.

E.g. you can override writeString(String text, List<TextPosition> textPositions). Its standard implementation only uses the text, i.e. the extracted plain text, but you can inspect the textPositions, i.e. text with extra information, among them position and size.

This answer shows how to override PDFTextStripper.writeString get access the font name. Similarly you can access the font size. Beware, there are two TextPosition methods for this, getFontSize and getFontSizeInPt, and you might actually need yet another kind of size.

EDIT

In a comment, the OP asked

How can I get start with PDFSteamEngine???

As mentioned above, have a look at some subclasses provided with PDFBox. The most prominent surely is the PDFTextStripper.

The most simple PDFTextStripper use looks like this:

PDFTextStripper stripper = new PDFTextStripper();
stripper.setSortByPosition(true);

PDDocument document = PDDocument.load(PDF_DOCUMENT);
String text = stripper.getText(document);
document.close();

This only extracts the plain text of the document. For more specialized tasks look at these sample usages:

ExtractTextByArea.java - PDFBox example on how to extract text from a specific area on the PDF document
PrintTextLocations.java - PDFBox example on how to get some x/y coordinates of text
Get font of each line using PDFBox - stackoverflow answer illustrating how to extract text with font information
Identifying the text based on the output in PDF using PDFBOX - stackoverflow answer illustrating how to extract text with color information
How to determine artificial bold style ,artificial italic style and artificial outline style of a text using PDFBOX - stackoverflow answer illustrating how to extract text identifying certain artificial styles
PDF file extraction using PDFBOX for tabular data - stackoverflow answer illustrating how to extract text attempting to reflect the PDF file layout in the output
How to check if a text is transparent with pdfbox - stackoverflow answer illustrating how to extract only text not covered by some image

More usage examples of PDFStreamEngine and other sub-classes:

PrintImageLocations.java - PDFBox example on how to get the x/y coordinates of image locations, based on PDFStreamEngine directly
Get Visible Signature from a PDF using PDFBox? - stackoverflow answer illustrating how to draw annotations, especially signature visualizations, based on PageDrawer

How can I obtain the Textposition from a PDF???

As mentioned in my original answer, use a PDFTextStripper and override writeString(String text, List<TextPosition> textPositions). Its standard implementation only uses the text, i.e. the extracted plain text, but you can inspect the textPositions, i.e. text with extra information, among them position and size.

Upvotes: 3

How can I get max fontsize of a pdf using pdfbox

Answers (1)

EDIT

Related Questions