XTL
XTL

Reputation: 1452

Extract Images from Pdf via iTextSharp 4.1.6.0

Hello all(and you Bruno too :) ).
I'm using iTextSharp 4.1.6.0 that ported for Xamarin.Android.
For some reason i need to extract images from pdf.
I founded too much examples,but seems they are not acceptable for my case,because some classes(like :
ImageCodeInfo , ImageRenderInfo , System.Drawing.Imaging.EncoderParameters , PdfImageObject and etc,doesn't exist).

But one example looks fine,here is it:

void ExtractJpeg(string file)
{
    var dir1 = Path.GetDirectoryName(file);
    var fn = Path.GetFileNameWithoutExtension(file);
    var dir2 = Path.Combine(dir1, fn);
    if (!Directory.Exists(dir2)) Directory.CreateDirectory(dir2);

    var pdf = new PdfReader(file);
    int n = pdf.NumberOfPages;
    for (int i = 1; i <= n; i++)
    {
        var pg = pdf.GetPageN(i);
        var res = PdfReader.GetPdfObject(pg.Get(PdfName.RESOURCES)) as PdfDictionary;
        var xobj = PdfReader.GetPdfObject(res.Get(PdfName.XOBJECT)) as PdfDictionary;
        if (xobj == null) continue;

        var keys = xobj.Keys;
        if (keys.Count == 0) continue;

        var obj = xobj.Get(keys.ElementAt(0));
        if (!obj.IsIndirect()) continue;

        var tg = PdfReader.GetPdfObject(obj) as PdfDictionary;
        var type = PdfReader.GetPdfObject(tg.Get(PdfName.SUBTYPE)) as PdfName;
        if (!PdfName.IMAGE.Equals(type)) continue;

        int XrefIndex = (obj as PRIndirectReference).Number;
        var pdfStream = pdf.GetPdfObject(XrefIndex) as PRStream;
        var data = PdfReader.GetStreamBytesRaw(pdfStream);
        var jpeg = Path.Combine(dir2, string.Format("{0:0000}.jpg", i));
        File.WriteAllBytes(jpeg, data);
    }
}    

And problem in this line :

var obj = xobj.Get(keys.ElementAt(0));  

Error log:

The type arguments for method `System.Linq.ParallelEnumerable.ElementAt(this System.Linq.ParallelQuery, int)' cannot be inferred from the usage. Try specifying the type arguments explicitly

I have no idea how to make workaround. Can some explain me ?

Also,i would like to know if exist another method to extract image from pdf.
Thanks!!

Upvotes: 1

Views: 2319

Answers (1)

Chris Haas
Chris Haas

Reputation: 55417

First, the obligatory speech about upgrading from old, obsolete and no longer officially supported software:

Please upgrade to the most recent version of iTextSharp. I know that you're going to say that you can't use iText's new license but please read their sales FAQ, specifically the "Why shouldn't I use..." section which addresses 4.1.6. Please remember that in most countries, accepting the license actually enters you into a legal contract so I would also have someone with legal experience read that, too. Since you say that you are using Xamarin I'm thinking that you are submitting this to a store, too, so this is even more important because the problems can multiply very fast.

Also, there's a new version of PDF coming out pretty soon and you'll probably want to be on track to support that, too.

Second, your code makes a giant and incorrect assumption that all images in a PDF are JPEGs. See this post and this post for a bit of a discussion on it. Maybe your PDFs are all JPEGs so this works for you but there's a good chance that this will break "tomorrow".

Third, I can't get ElementAt to work with an ICollection. I don't know if I'm missing an extension or a using somewhere but it appears that you copied the code from a five year old post here that came from a six year old post here. I'm also not sure why the "first" element is needed anyway, that's weird. The solution is to just loop over the keys instead of trying to just explicitly grab one. Instead of:

var obj = xobj.Get(keys.ElementAt(0));
//...
File.WriteAllBytes(jpeg, data);

Loop over each key:

foreach (PdfName k in keys) {
    var obj = xobj.Get(k);
    //...
    File.WriteAllBytes(jpeg, data);
}

This small change will make us all cry but it should make extraction of images at least work.

Upvotes: 3

Related Questions