Reputation: 35
want to extract whole images per page in a pdf document by using PDFBox in JAVA. but all extracted images were inverted and split. It should be noted that it's not a bug in PDFBox or poppler but some format reasons of the pdf document itself. so how can i piece together the whole image and get the right direction of every image? could anybody give me some advices? a snippet of JAVA code is preferred. my pdf link: download
Upvotes: 1
Views: 490
Reputation: 95918
At first glance it looked like each of the figures in question was drawn in a separate block of content stream instructions enveloped by but not containing text objects. Thus, one approach to isolate them is to export all such blocks of instructions to a separate new page. You then can post-process these new pages, e.g. by rendering them as bitmap images using a PdfRenderer
.
I based code doing this on the PdfContentStreamEditor
originally from this answer like this:
PDDocument document = PDDocument.load(...);
for (PDPage page : document.getDocumentCatalog().getPages()) {
PdfContentStreamEditor editor = new PdfContentStreamEditor(document, page) {
ByteArrayOutputStream commonRaw = null;
ContentStreamWriter commonWriter = null;
int depth = 0;
@Override
public void processPage(PDPage page) throws IOException {
commonRaw = new ByteArrayOutputStream();
try {
commonWriter = new ContentStreamWriter(commonRaw);
startFigurePage(page);
super.processPage(page);
} finally {
endFigurePage();
commonRaw.close();
}
}
@Override
protected void write(ContentStreamWriter contentStreamWriter, Operator operator,
List<COSBase> operands) throws IOException {
String operatorString = operator.getName();
if (operatorString.equals("BT")) {
endFigurePage();
}
if (operatorString.equals("q")) {
depth++;
}
writeFigure(operator, operands);
if (operatorString.equals("Q")) {
depth--;
}
if (operatorString.equals("ET")) {
startFigurePage(getCurrentPage());
}
super.write(contentStreamWriter, operator, operands);
}
OutputStream figureRaw = null;
ContentStreamWriter figureWriter = null;
PDPage figurePage = null;
int xobjectsDrawn = 0;
int pathsPainted = 0;
void startFigurePage(PDPage currentPage) throws IOException {
figurePage = new PDPage(currentPage.getMediaBox());
figurePage.setResources(currentPage.getResources());
PDStream stream = new PDStream(document);
figurePage.setContents(stream);
figureWriter = new ContentStreamWriter(figureRaw = stream.createOutputStream(COSName.FLATE_DECODE));
figureRaw.write(commonRaw.toByteArray());
xobjectsDrawn = 0;
pathsPainted = 0;
}
void endFigurePage() throws IOException {
if (figureWriter != null) {
figureWriter = null;
figureRaw.close();
figureRaw = null;
if (xobjectsDrawn > 0 || pathsPainted > 3)
document.addPage(figurePage);
figurePage = null;
}
}
final List<String> PATH_PAINTING_OPERATORS = Arrays.asList("S", "s", "F", "f", "f*",
"B", "B*", "b", "b*");
void writeFigure(Operator operator, List<COSBase> operands) throws IOException {
if (figureWriter != null) {
String operatorString = operator.getName();
boolean isXObjectDo = operatorString.equals("Do");
boolean isPathPainting = PATH_PAINTING_OPERATORS.contains(operatorString);
if (isXObjectDo)
xobjectsDrawn++;
if (isPathPainting)
pathsPainted++;
figureWriter.writeTokens(operands);
figureWriter.writeToken(operator);
if (depth == 0) {
if (!isXObjectDo) {
if (isPathPainting)
operator = Operator.getOperator("n");
commonWriter.writeTokens(operands);
commonWriter.writeToken(operator);
}
}
}
}
};
editor.processPage(page);
}
document.save(new File(RESULT_FOLDER, "my-isolatedFigures.pdf"));
(IsolateFigures test testIsolateInMy
)
The first figures are extracted quite fine:
Certain figures, though, turn out to contain text objects and, therefore, are separated in partial images and lose their text content:
Upvotes: 0
Reputation: 11730
Here are the first 6 Images and we can see they are simply the text on the write whereas the art work is specified as single vector line paths (as shown on the left)
Extracting such thousands or hundreds of images is more work than its worth
Page 1 alone has 115 at unusually high density of 1200 ptpi
C:\Apps\PDF\poppler\poppler-23.05.0\Library\bin>pdfimages -list -f 1 -l 1 my.pdf
page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
1 0 stencil 144 468 - 1 1 ccitt no 348 0 1200 1200 197B 2.3%
1 1 stencil 64 456 - 1 1 image no 349 0 1200 1200 165B 4.5%
1 2 stencil 64 456 - 1 1 image no 349 0 1200 1200 165B 4.5%
1 3 stencil 72 468 - 1 1 ccitt no 350 0 1200 1200 154B 3.7%
1 4 stencil 192 468 - 1 1 ccitt no 351 0 1200 1200 264B 2.4%
1 5 stencil 96 456 - 1 1 ccitt no 352 0 1200 1200 142B 2.6%
1 6 stencil 136 570 - 1 1 ccitt no 353 0 1200 1200 192B 2.0%
1 7 stencil 224 582 - 1 1 ccitt no 419 0 1200 1200 329B 2.0%
1 8 stencil 104 582 - 1 1 ccitt no 420 0 1200 1200 194B 2.6%
1 9 stencil 192 582 - 1 1 ccitt no 345 0 1200 1200 306B 2.2%
So export each marquee area as an image.
It is possible to define the area as program vectors but as fast as you see them (about 4 xy rect values) you could click to clipboard and automate save as image6.png 7.png 8.png etc.
There are those that attempt to specify how a white space may be defined as a capturable area but it depends if you have the time to write a custom detector, based on search for 6. blah
or 7. blah
(not 1. - 5.) then vector full width for a height under that. here using Poppler.
pdftoppm -f 1 -l 1 -r 300 -x 360 -W 1750 -y 375 -H 360 -png my.pdf out6
and now we have the measure of it we can apply the Y distance uplift between 6. and 7.
Upvotes: 0