Reputation: 11
I want to extract picture from pdf files by C++,but I don't understand the picture format in pdf files,does someone can help me?
I looked the content of pdf files by opening it with Notepad, I tried to unzip the content and failed to extact pictures
Upvotes: 0
Views: 622
Reputation: 11857
To show just one of many dozens of dozens of ways images can be permeated/permutated in PDF here is the smallest working example I can write easily.
It has the basic 9 colours for comparison RGB CMY AWK
If your editor is as good as MS Notepad it should work as colours.pdf However pasted on the web it will likely be corrupted so download is here. Colours.pdf should work in most viewers just not shown as a github page (but see later)
%PDF-1.7
%µ¶
1 0 obj <</Type/Catalog/Pages 2 0 R>> endobj
2 0 obj <</Type/Pages/Count 1/Kids[3 0 R]>> endobj
3 0 obj <</Type/Page/MediaBox[0 0 72 72]/Rotate 0/Resources 4 0 R/Contents 6 0 R/Parent 2 0 R>> endobj
4 0 obj <</XObject<</Img3 7 0 R>>>> endobj
5 0 obj <</Length 12/Type/XObject/Subtype/Image/Width 3/Height 3/BitsPerComponent 1/ColorSpace/DeviceGray/Filter/FlateDecode>>
stream
xœûÿ¿þ? ú}
endstream
endobj
6 0 obj <</Length 40/Filter/FlateDecode>>
stream
xœ3T0 B]C]s#…ä\. Ó!}ÏÜtc—|. È >
endstream
endobj
7 0 obj
<</Length 22/Type/XObject/Subtype/Image/Width 3/Height 3/BitsPerComponent 8/SMask 5 0 R/ColorSpace/DeviceRGB/Filter/FlateDecode>>
stream
xœûÏÀÀðŒÿÿ‡`L §sõ
endstream
endobj
xref
0 8
0000000000 00001 f
0000000016 00000 n
0000000062 00000 n
0000000114 00000 n
0000000218 00000 n
0000000262 00000 n
0000000427 00000 n
0000000535 00000 n
trailer
<</Size 8/Root 1 0 R>>
startxref
721
%%EOF
So points to note are
Filter/CCITTFaxDecode
thus complex to compare, so altered to same as the RGB colours so all are deflated as Filter/FlateDecode
So in order to extract the two images as one you need to write a library of functions, for every permutation you may encounter. However, it is way simpler to use a small 10-50 MB application in one executable that has most of those permutations already honed from many trials and errors.
Here is the pdf decoded to see exactly how those raw colours work.
https://github.com/GitHubRulesOK/MyNotes/blob/master/colours_decoded.pdf
BE AWARE, the colours are no longer true (you can see some have been neutered as text.) for simple illustration.
%PDF-1.7
%µ¶
1 0 obj <</Type/Catalog/Pages 2 0 R>> endobj
2 0 obj <</Type/Pages/Count 1/Kids [ 3 0 R ]>> endobj
3 0 obj <</Type/Page/MediaBox [ 0 0 72 72 ]/Rotate 0/Resources 4 0 R/Contents 6 0 R/Parent 2 0 R>> endobj
4 0 obj <</XObject<</Img3 7 0 R>>>> endobj
5 0 obj <</Length 4/Type/XObject/Subtype/Image/Width 3/Height 3/BitsPerComponent 1/ColorSpace/DeviceGray>>
stream
ÿÿÿ
endstream
endobj
6 0 obj <</Length 46>>
stream
1 0 0 -1 -0 72 cm
72 0 0 -72 0 72 cm
/Img3 Do
endstream
endobj
7 0 obj<</Length 27/Type/XObject/Subtype/Image/Width 3/Height 3/BitsPerComponent 8/SMask 5 0 R/ColorSpace/DeviceRGB>>
stream
ÿ ÿ ÿ ÿÿÿ ÿÿÿ ÿÿÿ
endstream
endobj
xref
0 8
0000000000 00002 f
0000000015 00000 n
0000000060 00000 n
0000000114 00000 n
0000000220 00000 n
0000000263 00000 n
0000000399 00000 n
0000000493 00000 n
trailer
<</Size 8/Root 1 0 R>>
startxref
664
%%EOF
So the colour block is ÿ ÿ ÿ ÿÿÿ ÿÿÿ ÿÿÿ
where each of those spaces should be a null.
ÿ
=R ÿ
=G ÿ
=B ÿÿ
=C ÿ ÿ
=M ÿÿ
=Y
=A ÿÿÿ
=W
=K
Also note that in this odd case decompressed is way smaller than when compressed (except for the RGB image).
A third variation of this simplified case is this WEB SAFE ASCII one https://github.com/GitHubRulesOK/MyNotes/blob/master/coloursAscii.pdf
This one is perhaps easier to see exactly what is happening, lossless picture wise in a PDF, if you study just the core objects.
5 0 obj <</Length 7/Type/XObject/Subtype/Image/Width 3/Height 3/BitsPerComponent 1/ColorSpace/DeviceGray/Filter/ASCIIHexDecode>>
stream
ffff7f>
endstream
endobj
6 0 obj <</Length 46>>
stream
1 0 0 -1 -0 72 cm
72 0 0 -72 0 72 cm
/Img3 Do
endstream
endobj
7 0 obj <</Length 65/Type/XObject/Subtype/Image/Width 3/Height 3/BitsPerComponent 8/SMask 5 0 R/ColorSpace/DeviceRGB/Filter/ASCIIHexDecode>>
stream
ff0000 00ff00 0000ff
00ffff ff00ff ffff00
000000 ffffff 000000>
endstream
endobj
Upvotes: 0
Reputation: 96009
To understand how in some file format certain data is stored, the best approach usually is to read the specification.
In the case at hand you should read the PDF specification ISO 32000, preferably the current ISO 32000-2:2020 but for starters the older ISO 32000-1:2008 should do, too. You can download a free copy of the latter at https://Adobe.com/go/pdfreference
Meanwhile even ISO 32000-2 is available for free; https://www.pdfa.org/announcing-no-cost-access-to-iso-32000-2-pdf-2-0/
I assume by "picture" you mean bitmap images which the PDF specification calls sampled images. Section 8.9 deals with them.
Upvotes: 0