Valery Letroye
Valery Letroye

Reputation: 1085

Remove PdfImageOject from a PDF

I have 1000th of PDF generated from emails containing .png (I am not owner of the generator). For some reasons, those PDF are very very slow to render with the Imaging system I am using (I am not the developer of that system and may not change it).

If I use iTextSharp and implement a IRenderListener to count the Images to be rendered, there are thousands per page (99% being 1 or 2 pixels only). But if I count the Images in the resources of the PDF, there are only a few (~tens).

I am counting the images in the resources, per page, with the code here after

var dict = pdfReader.GetPageN(currentPage)
PdfDictionary res = (PdfDictionary)PdfReader.GetPdfObject(dict.Get(PdfName.RESOURCES));
PdfDictionary xobj = (PdfDictionary)PdfReader.GetPdfObject(res.Get(PdfName.XOBJECT));

if (xobj != null)
{
 foreach (PdfName name in xobj.Keys)
 {
   PdfObject obj = xobj.Get(name);
   if ((obj.IsIndirect()))
   {
     PdfDictionary tg = (PdfDictionary)PdfReader.GetPdfObject(obj);
     PdfName subtype = (PdfName)PdfReader.GetPdfObject(tg.Get(PdfName.SUBTYPE));
     if (PdfName.IMAGE.Equals(subtype))
     {
       Count++

And my IRenderListener looks like this:

class ImageRenderListener : IRenderListener
{
    public void RenderImage(iTextSharp.text.pdf.parser.ImageRenderInfo renderInfo)
    {
        PdfImageObject image = renderInfo.GetImage();
        if (image == null) return;
        var refObj = renderInfo.GetRef();
        if (refObj == null)
            Count++; // but why no ref ??
        else
            Count++;
    }

I just started to learn about PDF specification and iTextSharp this evening, to analyze my PDF and understand what could be wrong... if I am correct, I see that many images to be rendered that are not referencing a resource (refObj == null) and that they are .png (image.streamContentType.FileExtension = "png"). So, I think those are the images making the rendering so slow...

For testing purpose, I would like to delete those images from the PDF but don't find how to proceed.

I only found code samples to remove image that are in the resources... but the images I want to delete are not :/

Is there any code sample somewhere to help me ? I did google on "iTextSharp remove object", etc... but there was nothing similar to my case :(

Upvotes: 1

Views: 2548

Answers (1)

Bruno Lowagie
Bruno Lowagie

Reputation: 77528

Let me start with the blunt observation that you have a shitty PDF.

The image you see when opening the PDF in a PDF viewer seems to be composed of several small 1- or 2-pixel images. The drawing operations to show these pixels one by one is suboptimal, no matter which imaging system you use: you are faced with a bad PDF.

In your first snippet, I see that you loop over all of the indirect objects stored in the the XObject resources of each page in search of images. You count these images, resulting in a number of Image XObjects stored in the PDF. If you add up all the Count values for all the pages, this number can be higher than the actual number of Image XObject stored in the PDF as you don't take into account that some images can be reused on different pages.

You do not count the inline images that are stored in the content streams. I'm biased. In the ISO committees for PDF, I'm on the side of the group of people saying that "inline images are evil" and "inline images should die". For now, we didn't succeed in getting rid of inline images, but we introduced some substantial limitations that should reduce the (ab)use of inline images in PDF that conform to ISO-32000-2 (the PDF 2.0 spec that is due in 2016).

You've already discovered that your PDF has inline images. Those are the images where refObj == null. They are not stored as indirect objects; they are stored inline, in the content stream of the page. As you can imagine based on my feelings towards inline images, I consider your PDF being a bad PDF for this reason (although it does conform to ISO-32000-1).

  1. The presence of inline images is a first explanation why you have a different image count: when you loop over the indirect objects you only find part of the images. When you parse the document for images, you also find the inline images.
  2. A second explanation could be the fact that the Image XObject are used more than once. That's the whole point of not using inline images. For instance: if you have an image that represents a logo that needs to be repeated on every page, one could use inline images. That would be a bad idea: the same image bytes would be present in the PDF as many times as there are pages. One should use an Image XObject. In this case, the image bytes of the logo are stored only once in an indirect object. There's a reference to this object from every page, so that the image bytes are stored in the document only once. In a 10-page document, you can see 10 identical images on 10 pages, but when looking inside the document, you'll find only one image that is referenced from every page.

If you remove Image XObjects by removing the indirect objects containing the image stream objects, you have to be very careful: are you sure you're not corrupting your document? Because there's a reference to the Image XObject in the content stream of your page. This reference points to an entry in the /XObjects entry of the page's /Resources. This /XObject references to the stream object with the image bytes. If you remove that indirect object without removing the references (e.g. from the content stream), you break your PDF. Some viewers will ignore those errors, but at some point in time some tool (or some body) is going to complain that your PDF is corrupt.

If you want to remove inline images, you have to parse all the content streams in your PDF: page content streams as well as Form XObject content streams. You have to rewrite all these streams and make sure all inline images are removed. That is: all objects that that start with the BI operator (Begin Image) and end with the EI operator (End Image).

That's a task for a PDF specialist who knows both iTextSharp and ISO-32000-1 inside-out. The solution to your problem probably doesn't fit into an answering window on StackOverflow.

I'm the original author of iText. From a certain point of view, iText is like a sharp knife. A sharp knife is a very good tool that can be used for many good things. However, you can also seriously cut your fingers when you're not using the knife in a correct way. I hope you'll be careful and that you're not going to create a whole series of damaged PDF files.

For instance: you assume that some of the files in the PDF are PNGs because iText suggests to store them as PNGs. However: PNG is not supported by ISO-32000-1, so your assumption that your PDF contains PNGs is wrong. I honestly worry when I see questions like yours.

Upvotes: 3

Related Questions