Reputation: 90193
I want to remove all images from a PDF file.
The page layouts should not change. All images should be replaced by empty space.
Upvotes: 15
Views: 8363
Reputation: 11739
A comment was made that using recent GhostScript this method can be slow.
Especially with the Adobe ISO Standard file as a test.
The key point about GhostScript method is it has to generally write a totally different output file and "filter out" the unwanted image types. As a consequence the speed can become much slower and slower as the output file gets bigger.
Other applications may only seek images and rapidly remove their entries without parsing any other slowing down contents.
An example is Coherent cpdf which for the same file is about 6 or more times faster, depending on size.
cpdf -i pdf_reference_1-7.pdf -draft -o pdfref-no-images.pdf
or cpdf -draft in.pdf -o out.pdf
The results look similar but beware not ALL images in either GhostScript or cpdf method will be removed. Which was described by Kurt Pfeifle in his accepted answer. This is due to the many ways an image like object can be defined. As such those objects may need manual selection and redaction.
Upvotes: 0
Reputation: 90193
Meanwhile the latest Ghostscript releases have a much nicer and easier to use method of removing all images from a PDF. The parameter to add to the command line is -dFILTERIMAGE
gs -o noimages.pdf -sDEVICE=pdfwrite -dFILTERIMAGE input.pdf
Even better, you can also remove all text or all vector drawing elements from a PDF by specifying -dFILTERTEXT
or -dFILTERVECTOR
.
Of course, you can also combine any combination of these -dFILTER*
parameters you want in order to achieve a required result. (Combining all three will of course result in "empty" pages.)
Here is the screenshot from an example PDF page which contains all 3 types of content mentioned above:
Screenshot of original PDF page containing "image", "vector" and "text" elements.
Running the following 6 commands will create all 6 possible variations of remaining contents:
gs -o noIMG.pdf -sDEVICE=pdfwrite -dFILTERIMAGE input.pdf gs -o noTXT.pdf -sDEVICE=pdfwrite -dFILTERTEXT input.pdf gs -o noVCT.pdf -sDEVICE=pdfwrite -dFILTERVECTOR input.pdf gs -o onlyTXT.pdf -sDEVICE=pdfwrite -dFILTERVECTOR -dFILTERIMAGE input.pdf gs -o onlyIMG.pdf -sDEVICE=pdfwrite -dFILTERVECTOR -dFILTERTEXT input.pdf gs -o onlyVCT.pdf -sDEVICE=pdfwrite -dFILTERIMAGE -dFILTERTEXT input.pdf
The following image illustrates the results:
Top row, from left: all "text" removed; all "images" removed; all "vectors" removed. Bottom row, from left: only "text" kept; only "images" kept; only "vectors" kept.
Upvotes: 25
Reputation: 90193
I'm putting up the answer myself, but the actual code is by courtesy of Chris Liddell, Ghostscript developer.
I used his original PostScript code and stripped off its other functions. Only the function which removes raster images remains. Other graphical page objects -- text sections, patterns and vector objects -- should remain untouched.
Copy the following code and save it as remove-images.ps
:
%!PS
% Run as:
%
% gs ..... -dFILTERIMAGE -dDELAYBIND -dWRITESYSTEMDICT \
% ..... remove-images.ps <your-input-file>
%
% derived from Chris Liddell's original 'filter-obs.ps' script
% Adapted by @pdfkungfoo (on Twitter)
currentglobal true setglobal
32 dict begin
/debugprint { systemdict /DUMPDEBUG .knownget { {print flush} if}
{pop} ifelse } bind def
/pushnulldevice {
systemdict exch .knownget not
{
//false
} if
{
gsave
matrix currentmatrix
nulldevice
setmatrix
} if
} bind def
/popnulldevice {
systemdict exch .knownget not
{
//false
} if
{
% this is hacky - some operators clear the current point
% i.e.
{ currentpoint } stopped
{ grestore }
{ grestore moveto} ifelse
} if
} bind def
/sgd {systemdict exch get def} bind def
systemdict begin
/_image /image sgd
/_imagemask /imagemask sgd
/_colorimage /colorimage sgd
/image {
(\nIMAGE\n) //debugprint exec /FILTERIMAGE //pushnulldevice exec
_image
/FILTERIMAGE //popnulldevice exec
} bind def
/imagemask
{
(\nIMAGEMASK\n) //debugprint exec
/FILTERIMAGE //pushnulldevice exec
_imagemask
/FILTERIMAGE //popnulldevice exec
} bind def
/colorimage
{
(\nCOLORIMAGE\n) //debugprint exec
/FILTERIMAGE //pushnulldevice exec
_colorimage
/FILTERIMAGE //popnulldevice exec
} bind def
end
end
.bindnow
setglobal
Now run this command:
gs -o no-more-images-in-sample.pdf \
-sDEVICE=pdfwrite \
-dFILTERIMAGE \
-dDELAYBIND \
-dWRITESYSTEMDICT \
remove-images.ps \
sample.pdf
I tested the code with the official PDF specification, and it worked. The following two screenshots show page 750 of input and output PDFs:
If you wonder why something that looks like an image is still on the output page: it is not really a raster image, but a 'pattern' in the original file, and therefor it is not removed.
Upvotes: 10