Kurt Pfeifle
Kurt Pfeifle

Reputation: 90193

How can I remove all images from a PDF?

I want to remove all images from a PDF file.

The page layouts should not change. All images should be replaced by empty space.

Upvotes: 15

Views: 8363

Answers (3)

K J
K J

Reputation: 11739

A comment was made that using recent GhostScript this method can be slow.

Especially with the Adobe ISO Standard file as a test.

The key point about GhostScript method is it has to generally write a totally different output file and "filter out" the unwanted image types. As a consequence the speed can become much slower and slower as the output file gets bigger.

Other applications may only seek images and rapidly remove their entries without parsing any other slowing down contents.

An example is Coherent cpdf which for the same file is about 6 or more times faster, depending on size.

cpdf -i pdf_reference_1-7.pdf -draft -o pdfref-no-images.pdf
or cpdf -draft in.pdf -o out.pdf

The results look similar but beware not ALL images in either GhostScript or cpdf method will be removed. Which was described by Kurt Pfeifle in his accepted answer. This is due to the many ways an image like object can be defined. As such those objects may need manual selection and redaction.

enter image description here

Upvotes: 0

Kurt Pfeifle
Kurt Pfeifle

Reputation: 90193

Meanwhile the latest Ghostscript releases have a much nicer and easier to use method of removing all images from a PDF. The parameter to add to the command line is -dFILTERIMAGE

 gs -o noimages.pdf -sDEVICE=pdfwrite -dFILTERIMAGE input.pdf

Even better, you can also remove all text or all vector drawing elements from a PDF by specifying -dFILTERTEXT or -dFILTERVECTOR.

Of course, you can also combine any combination of these -dFILTER* parameters you want in order to achieve a required result. (Combining all three will of course result in "empty" pages.)

Here is the screenshot from an example PDF page which contains all 3 types of content mentioned above:


Screenshot of original PDF page containing "image", "vector" and "text" elements.
Screenshot of original PDF page containing "image", "vector" and "text" elements.


Running the following 6 commands will create all 6 possible variations of remaining contents:

 gs -o noIMG.pdf   -sDEVICE=pdfwrite -dFILTERIMAGE                input.pdf
 gs -o noTXT.pdf   -sDEVICE=pdfwrite -dFILTERTEXT                 input.pdf
 gs -o noVCT.pdf   -sDEVICE=pdfwrite -dFILTERVECTOR               input.pdf

 gs -o onlyTXT.pdf -sDEVICE=pdfwrite -dFILTERVECTOR -dFILTERIMAGE input.pdf 
 gs -o onlyIMG.pdf -sDEVICE=pdfwrite -dFILTERVECTOR -dFILTERTEXT  input.pdf
 gs -o onlyVCT.pdf -sDEVICE=pdfwrite -dFILTERIMAGE  -dFILTERTEXT  input.pdf

The following image illustrates the results:


Top row, from left: all "text" removed; all "images" removed; all "vectors" removed. Bottom row, from left: only "text" kept; only "images" kept; only "vectors" kept.
Top row, from left: all "text" removed; all "images" removed; all "vectors" removed. Bottom row, from left: only "text" kept; only "images" kept; only "vectors" kept.


Upvotes: 25

Kurt Pfeifle
Kurt Pfeifle

Reputation: 90193

I'm putting up the answer myself, but the actual code is by courtesy of Chris Liddell, Ghostscript developer.

I used his original PostScript code and stripped off its other functions. Only the function which removes raster images remains. Other graphical page objects -- text sections, patterns and vector objects -- should remain untouched.

Copy the following code and save it as remove-images.ps:

%!PS

% Run as:
%
%      gs ..... -dFILTERIMAGE -dDELAYBIND -dWRITESYSTEMDICT \
%                 ..... remove-images.ps <your-input-file>
%
% derived from Chris Liddell's original 'filter-obs.ps' script
% Adapted by @pdfkungfoo (on Twitter)

currentglobal true setglobal

32 dict begin

/debugprint     { systemdict /DUMPDEBUG .knownget { {print flush} if} 
                {pop} ifelse } bind def

/pushnulldevice {
  systemdict exch .knownget not
  {
    //false
  } if

  {
    gsave
    matrix currentmatrix
    nulldevice
    setmatrix
  } if
} bind def

/popnulldevice {
  systemdict exch .knownget not
  {
    //false
  } if
  {
    % this is hacky - some operators clear the current point
    % i.e.
    { currentpoint } stopped
    { grestore }
    { grestore moveto} ifelse
  } if
} bind def

/sgd {systemdict exch get def} bind def

systemdict begin

/_image /image sgd
/_imagemask /imagemask sgd
/_colorimage /colorimage sgd

/image {
   (\nIMAGE\n) //debugprint exec /FILTERIMAGE //pushnulldevice exec
  _image
  /FILTERIMAGE //popnulldevice exec
} bind def

/imagemask
{
  (\nIMAGEMASK\n) //debugprint exec
  /FILTERIMAGE //pushnulldevice exec
  _imagemask
  /FILTERIMAGE //popnulldevice exec
} bind def

/colorimage
{
  (\nCOLORIMAGE\n) //debugprint exec
  /FILTERIMAGE //pushnulldevice exec
  _colorimage
  /FILTERIMAGE //popnulldevice exec
} bind def

end
end

.bindnow

setglobal

Now run this command:

gs -o no-more-images-in-sample.pdf \
   -sDEVICE=pdfwrite               \
   -dFILTERIMAGE                   \
   -dDELAYBIND                     \
   -dWRITESYSTEMDICT               \
    remove-images.ps               \
    sample.pdf

I tested the code with the official PDF specification, and it worked. The following two screenshots show page 750 of input and output PDFs:

If you wonder why something that looks like an image is still on the output page: it is not really a raster image, but a 'pattern' in the original file, and therefor it is not removed.

Upvotes: 10

Related Questions