Reputation: 11492

How do I extract attachments from a pdf file?

I have a big number pdf documents with xml files attached to them. I would like to extract those attached xml files and read them. How can I do this programatically using .net?

Upvotes: 10

Answers (5)

Mark Storer

Reputation: 15868

iTextSharp is also quite capable of extracting attachments... Though you might have to use the low level objects to do so.

There are two ways to embed files in a PDF:

In a File Annotation
At the document level "EmbeddedFiles".

Once you have a file specification dictionary from either source, the file itself will be a stream within the dictionary labeled "EF" (embedded file).

So to list all the files at the document level, one would write code (in Java) as such:

Map<String, byte[]> files = new HashMap<String,byte[]>();

PdfReader reader = new PdfReader(pdfPath);
PdfDictionary root = reader.getCatalog();
PdfDictionary names = root.getAsDict(PdfName.NAMES); // may be null
PdfDictionary embeddedFilesDict = names.getAsDict(PdfName.EMBEDDEDFILES); //may be null
PdfArray embeddedFiles = embeddedFilesDict.getAsArray(PdfName.NAMES); // may be null

int len = embeddedFiles.size();
for (int i = 0; i < len; i += 2) {
  PdfString name = embeddedFiles.getAsString(i); // should always be present
  PdfDictionary fileSpec = embeddedFiles.getAsDict(i+1); // ditto

  PdfDictionary streams = fileSpec.getAsDict(PdfName.EF);
  PRStream stream = null;

  if (streams.contains(PdfName.UF))
    stream = (PRStream)streams.getAsStream(PdfName.UF);
  else
    stream = (PRStream)streams.getAsStream(PdfName.F); // Default stream for backwards compatibility

  if (stream != null) {
    files.put( name.toUnicodeString(), PdfReader.getStreamBytes((PRStream)stream));
  }
}

Upvotes: 11

Stefano Chizzolini

Reputation: 667

This is an old question, nonetheless I think my alternative solution (using PDF Clown) may be of some interest as it's way much cleaner (and more complete, as it iterates both at document and page level) than the code fragments previously proposed:

using org.pdfclown.bytes;
using org.pdfclown.documents;
using org.pdfclown.documents.files;
using org.pdfclown.documents.interaction.annotations;
using org.pdfclown.objects;

using System;
using System.Collections.Generic;

void ExtractAttachments(string pdfPath)
{
  Dictionary<string, byte[]> attachments = new Dictionary<string, byte[]>();

  using(org.pdfclown.files.File file = new org.pdfclown.files.File(pdfPath))
  {
    Document document = file.Document;

    // 1. Embedded files (document level).
    foreach(KeyValuePair<PdfString,FileSpecification> entry in document.Names.EmbeddedFiles)
    {EvaluateDataFile(attachments, entry.Value);}

    // 2. File attachments (page level).
    foreach(Page page in document.Pages)
    {
      foreach(Annotation annotation in page.Annotations)
      {
        if(annotation is FileAttachment)
        {EvaluateDataFile(attachments, ((FileAttachment)annotation).DataFile);}
      }
    }
  }
}

void EvaluateDataFile(Dictionary<string, byte[]> attachments, FileSpecification dataFile)
{
  if(dataFile is FullFileSpecification)
  {
    EmbeddedFile embeddedFile = ((FullFileSpecification)dataFile).EmbeddedFile;
    if(embeddedFile != null)
    {attachments[dataFile.Path] = embeddedFile.Data.ToByteArray();}
  }
}

Note that you don't have to bother with null pointer exceptions as PDF Clown provides all the necessary abstraction and automation to ensure smooth model traversal.

PDF Clown is an LGPL 3 library, implemented both in Java and .NET platforms (I'm its lead developer): if you want to get it a try, I suggest you to check out its SVN repository on sourceforge.net as it keeps evolving.

Upvotes: 6

Robert Lane

Reputation: 31

What I got working is slightly different then anything else I have seen online.

So, just in case, I thought I would post this here to help someone else. I had to go through many different iterations to figure out - the hard way - what I needed to get it to work.

I am merging two PDFs into a third PDF, where one of the first two PDFs may have file attachments that need to be carried over into the third PDF. I am working completely in streams with ASP.NET, C# 4.0, ITextSharp 5.1.2.0.

        // Extract Files from Submit PDF
        Dictionary<string, byte[]> files = new Dictionary<string, byte[]>();

        PdfDictionary names;
        PdfDictionary embeddedFiles;
        PdfArray fileSpecs;
        int eFLength = 0;


        names = writeReader.Catalog.GetAsDict(PdfName.NAMES); // may be null, writeReader is the PdfReader for a PDF input stream
        if (names != null)
        {
            embeddedFiles = names.GetAsDict(PdfName.EMBEDDEDFILES); //may be null
            if (embeddedFiles != null)
            {
                fileSpecs = embeddedFiles.GetAsArray(PdfName.NAMES); //may be null
                if (fileSpecs != null)
                {
                    eFLength = fileSpecs.Size;

                    for (int i = 0; i < eFLength; i++)
                    {
                        i++; //objects are in pairs and only want odd objects (1,3,5...)
                        PdfDictionary fileSpec = fileSpecs.GetAsDict(i); // may be null
                        if (fileSpec != null)
                        {
                            PdfDictionary refs = fileSpec.GetAsDict(PdfName.EF);
                            foreach (PdfName key in refs.Keys)
                            {
                                PRStream stream = (PRStream)PdfReader.GetPdfObject(refs.GetAsIndirectObject(key));

                                if (stream != null)
                                {
                                    files.Add(fileSpec.GetAsString(key).ToString(), PdfReader.GetStreamBytes(stream));
                                }
                            }
                        }
                    }
                }
            }
        }

Upvotes: 3

Shahzad Latif

Reputation: 1424

You may try Aspose.Pdf.Kit for .NET. The PdfExtractor class allows you to extract attachments with the help of two methods: ExtractAttachment and GetAttachment. Please see an example of attachment extraction.

Disclosure: I work as developer evangelist at Aspose.

Upvotes: 1

Aykut Çevik

Reputation: 2088

Look for ABCpdf-Library, very easy and fast in my opinion.

Upvotes: 2

How do I extract attachments from a pdf file?

Answers (5)

Related Questions