Reputation: 1085
I have 1000th of PDF generated from emails containing .png (I am not owner of the generator). For some reasons, those PDF are very very slow to render with the Imaging system I am using (I am not the developer of that system and may not change it).
If I use iTextSharp and implement a IRenderListener to count the Images to be rendered, there are thousands per page (99% being 1 or 2 pixels only). But if I count the Images in the resources of the PDF, there are only a few (~tens).
I am counting the images in the resources, per page, with the code here after
var dict = pdfReader.GetPageN(currentPage)
PdfDictionary res = (PdfDictionary)PdfReader.GetPdfObject(dict.Get(PdfName.RESOURCES));
PdfDictionary xobj = (PdfDictionary)PdfReader.GetPdfObject(res.Get(PdfName.XOBJECT));
if (xobj != null)
{
foreach (PdfName name in xobj.Keys)
{
PdfObject obj = xobj.Get(name);
if ((obj.IsIndirect()))
{
PdfDictionary tg = (PdfDictionary)PdfReader.GetPdfObject(obj);
PdfName subtype = (PdfName)PdfReader.GetPdfObject(tg.Get(PdfName.SUBTYPE));
if (PdfName.IMAGE.Equals(subtype))
{
Count++
And my IRenderListener looks like this:
class ImageRenderListener : IRenderListener
{
public void RenderImage(iTextSharp.text.pdf.parser.ImageRenderInfo renderInfo)
{
PdfImageObject image = renderInfo.GetImage();
if (image == null) return;
var refObj = renderInfo.GetRef();
if (refObj == null)
Count++; // but why no ref ??
else
Count++;
}
I just started to learn about PDF specification and iTextSharp this evening, to analyze my PDF and understand what could be wrong... if I am correct, I see that many images to be rendered that are not referencing a resource (refObj == null) and that they are .png (image.streamContentType.FileExtension = "png"). So, I think those are the images making the rendering so slow...
For testing purpose, I would like to delete those images from the PDF but don't find how to proceed.
I only found code samples to remove image that are in the resources... but the images I want to delete are not :/
Is there any code sample somewhere to help me ? I did google on "iTextSharp remove object", etc... but there was nothing similar to my case :(
Upvotes: 1
Views: 2548
Reputation: 77528
Let me start with the blunt observation that you have a shitty PDF.
The image you see when opening the PDF in a PDF viewer seems to be composed of several small 1- or 2-pixel images. The drawing operations to show these pixels one by one is suboptimal, no matter which imaging system you use: you are faced with a bad PDF.
In your first snippet, I see that you loop over all of the indirect objects stored in the the XObject resources of each page in search of images. You count these images, resulting in a number of Image XObjects stored in the PDF. If you add up all the Count
values for all the pages, this number can be higher than the actual number of Image XObject stored in the PDF as you don't take into account that some images can be reused on different pages.
You do not count the inline images that are stored in the content streams. I'm biased. In the ISO committees for PDF, I'm on the side of the group of people saying that "inline images are evil" and "inline images should die". For now, we didn't succeed in getting rid of inline images, but we introduced some substantial limitations that should reduce the (ab)use of inline images in PDF that conform to ISO-32000-2 (the PDF 2.0 spec that is due in 2016).
You've already discovered that your PDF has inline images. Those are the images where refObj == null
. They are not stored as indirect objects; they are stored inline, in the content stream of the page. As you can imagine based on my feelings towards inline images, I consider your PDF being a bad PDF for this reason (although it does conform to ISO-32000-1).
If you remove Image XObjects by removing the indirect objects containing the image stream objects, you have to be very careful: are you sure you're not corrupting your document? Because there's a reference to the Image XObject in the content stream of your page. This reference points to an entry in the /XObjects
entry of the page's /Resources
. This /XObject
references to the stream object with the image bytes. If you remove that indirect object without removing the references (e.g. from the content stream), you break your PDF. Some viewers will ignore those errors, but at some point in time some tool (or some body) is going to complain that your PDF is corrupt.
If you want to remove inline images, you have to parse all the content streams in your PDF: page content streams as well as Form XObject content streams. You have to rewrite all these streams and make sure all inline images are removed. That is: all objects that that start with the BI
operator (Begin Image) and end with the EI
operator (End Image).
That's a task for a PDF specialist who knows both iTextSharp and ISO-32000-1 inside-out. The solution to your problem probably doesn't fit into an answering window on StackOverflow.
I'm the original author of iText. From a certain point of view, iText is like a sharp knife. A sharp knife is a very good tool that can be used for many good things. However, you can also seriously cut your fingers when you're not using the knife in a correct way. I hope you'll be careful and that you're not going to create a whole series of damaged PDF files.
For instance: you assume that some of the files in the PDF are PNGs because iText suggests to store them as PNGs. However: PNG is not supported by ISO-32000-1, so your assumption that your PDF contains PNGs is wrong. I honestly worry when I see questions like yours.
Upvotes: 3