Reputation: 11

How extract picture from pdf file

I want to extract picture from pdf files by C++,but I don't understand the picture format in pdf files,does someone can help me?

I looked the content of pdf files by opening it with Notepad, I tried to unzip the content and failed to extact pictures

Upvotes: 0

Answers (2)

K J

Reputation: 11857

To show just one of many dozens of dozens of ways images can be permeated/permutated in PDF here is the smallest working example I can write easily.

It has the basic 9 colours for comparison RGB CMY AWK

If your editor is as good as MS Notepad it should work as colours.pdf However pasted on the web it will likely be corrupted so download is here. Colours.pdf should work in most viewers just not shown as a github page (but see later)

%PDF-1.7
%ÂµÂ¶
1 0 obj <</Type/Catalog/Pages 2 0 R>> endobj
2 0 obj <</Type/Pages/Count 1/Kids[3 0 R]>> endobj
3 0 obj <</Type/Page/MediaBox[0 0 72 72]/Rotate 0/Resources 4 0 R/Contents 6 0 R/Parent 2 0 R>> endobj
4 0 obj <</XObject<</Img3 7 0 R>>>> endobj
5 0 obj <</Length 12/Type/XObject/Subtype/Image/Width 3/Height 3/BitsPerComponent 1/ColorSpace/DeviceGray/Filter/FlateDecode>>
stream
xœûÿ¿þ? ú}
endstream
endobj
6 0 obj <</Length 40/Filter/FlateDecode>>
stream
xœ3T0 B]C]s#…ä\.    Ó!}ÏÜtc—|. È    >
endstream
endobj
7 0 obj
<</Length 22/Type/XObject/Subtype/Image/Width 3/Height 3/BitsPerComponent 8/SMask 5 0 R/ColorSpace/DeviceRGB/Filter/FlateDecode>>
stream
xœûÏÀÀðŒÿÿ‡`L §sõ
endstream
endobj

xref
0 8
0000000000 00001 f
0000000016 00000 n
0000000062 00000 n
0000000114 00000 n
0000000218 00000 n
0000000262 00000 n
0000000427 00000 n
0000000535 00000 n

trailer
<</Size 8/Root 1 0 R>>
startxref
721
%%EOF

So points to note are

it is 3 pels wide by 3 pels high
each pixel is 1/3 of an inch in both directions
The source can be PBM PNG GIF TIF or any other bitmap format (even jpeg) but the PDF writer needs to throw away any such heading and use raw pixels so only 9 colours are required to store this image.
IF the source is Baseline Jpg it may be imported 100% without strip headers.
IF the image has Alpha colour (as here from a PNG) then the Alpha data also will be a separate object.
Each object can have its own compression format (one or two of many filters) and may even be encrypted. Here the Alpha was Filter/CCITTFaxDecode thus complex to compare, so altered to same as the RGB colours so all are deflated as Filter/FlateDecode

So in order to extract the two images as one you need to write a library of functions, for every permutation you may encounter. However, it is way simpler to use a small 10-50 MB application in one executable that has most of those permutations already honed from many trials and errors.

Here is the pdf decoded to see exactly how those raw colours work.
https://github.com/GitHubRulesOK/MyNotes/blob/master/colours_decoded.pdf

BE AWARE, the colours are no longer true (you can see some have been neutered as text.) for simple illustration.

%PDF-1.7
%ÂµÂ¶
1 0 obj <</Type/Catalog/Pages 2 0 R>> endobj
2 0 obj <</Type/Pages/Count 1/Kids [ 3 0 R ]>> endobj
3 0 obj <</Type/Page/MediaBox [ 0 0 72 72 ]/Rotate 0/Resources 4 0 R/Contents 6 0 R/Parent 2 0 R>> endobj
4 0 obj <</XObject<</Img3 7 0 R>>>> endobj
5 0 obj <</Length 4/Type/XObject/Subtype/Image/Width 3/Height 3/BitsPerComponent 1/ColorSpace/DeviceGray>>
stream
ÿÿÿ
endstream
endobj
6 0 obj <</Length 46>>
stream
1 0 0 -1 -0 72 cm
72 0 0 -72 0 72 cm
/Img3 Do
endstream
endobj

7 0 obj<</Length 27/Type/XObject/Subtype/Image/Width 3/Height 3/BitsPerComponent 8/SMask 5 0 R/ColorSpace/DeviceRGB>>
stream
ÿ   ÿ   ÿ ÿÿÿ ÿÿÿ    ÿÿÿ   
endstream
endobj

xref
0 8
0000000000 00002 f 
0000000015 00000 n 
0000000060 00000 n 
0000000114 00000 n 
0000000220 00000 n 
0000000263 00000 n 
0000000399 00000 n 
0000000493 00000 n 

trailer
<</Size 8/Root 1 0 R>>
startxref
664
%%EOF

So the colour block is ÿ ÿ ÿ ÿÿÿ ÿÿÿ ÿÿÿ where each of those spaces should be a null.

ÿ =R ÿ=G ÿ=B ÿÿ=C ÿ ÿ=M ÿÿ =Y =A ÿÿÿ=W =K

Also note that in this odd case decompressed is way smaller than when compressed (except for the RGB image).

A third variation of this simplified case is this WEB SAFE ASCII one https://github.com/GitHubRulesOK/MyNotes/blob/master/coloursAscii.pdf

This one is perhaps easier to see exactly what is happening, lossless picture wise in a PDF, if you study just the core objects.

5 0 obj <</Length 7/Type/XObject/Subtype/Image/Width 3/Height 3/BitsPerComponent 1/ColorSpace/DeviceGray/Filter/ASCIIHexDecode>>
stream
ffff7f>
endstream
endobj
6 0 obj <</Length 46>>
stream
1 0 0 -1 -0 72 cm
72 0 0 -72 0 72 cm
/Img3 Do

endstream
endobj
7 0 obj <</Length 65/Type/XObject/Subtype/Image/Width 3/Height 3/BitsPerComponent 8/SMask 5 0 R/ColorSpace/DeviceRGB/Filter/ASCIIHexDecode>>
stream
ff0000 00ff00 0000ff
00ffff ff00ff ffff00
000000 ffffff 000000>
endstream
endobj

Upvotes: 0

mkl

Reputation: 96009

To understand how in some file format certain data is stored, the best approach usually is to read the specification.

In the case at hand you should read the PDF specification ISO 32000, preferably the current ISO 32000-2:2020 but for starters the older ISO 32000-1:2008 should do, too. You can download a free copy of the latter at https://Adobe.com/go/pdfreference

Meanwhile even ISO 32000-2 is available for free; https://www.pdfa.org/announcing-no-cost-access-to-iso-32000-2-pdf-2-0/

I assume by "picture" you mean bitmap images which the PDF specification calls sampled images. Section 8.9 deals with them.

Upvotes: 0

How extract picture from pdf file

Answers (2)

Related Questions