Reputation: 25

How to download embedded PDF files in an excel worksheet?

So I have this program that I made that parses out excel data (using Gembox). However sometimes there are embedded PDF files in some of the worksheets that I would like to download/extract. I have not been able to find a way to detect and download these objects. Anyone able to point me in the right direction on how this is achieved? I know Microsoft has an Office Document extractor that reads excel files, but it only detects Office files like word and such.

Im not asking anyone to do my work for me and write out code, Im just lost here, seems like its a pretty complex process.

Upvotes: 0

Answers (1)

GemBox Dev Team

Reputation: 669

UPDATE (2020-03-28)

Newer versions of GemBox.Spreadsheet have support for ExcelWorksheet.EmbeddedObjects.

So, you can now use the following:

var workbook = ExcelFile.Load("input.xlsx");
var worksheet = workbook.Worksheets[0];

var embeddedObjects = worksheet.EmbeddedObjects;

for (int index = 0; index < embeddedObjects.Count; index++)
{
    ExcelEmbeddedObject embeddedObject = embeddedObjects[index];
    if (embeddedObject.ContentType != "application/vnd.openxmlformats-officedocument.oleObject")
        continue;

    byte[] embeddedBytes;
    using (var memoryStream = new MemoryStream())
    {
        embeddedObject.Data.CopyTo(memoryStream);
        embeddedBytes = memoryStream.ToArray();
    }

    string embeddedContent = Encoding.ASCII.GetString(embeddedBytes);
    int pdfHeaderIndex = embeddedContent.IndexOf("%PDF");
    if (pdfHeaderIndex < 0)
        continue;

    byte[] pdfBytes = new byte[embeddedBytes.Length - pdfHeaderIndex];
    Array.Copy(embeddedBytes, pdfHeaderIndex, pdfBytes, 0, pdfBytes.Length);

    File.WriteAllBytes($"embedded-pdf-{index}.pdf", pdfBytes);
}

ORIGINAL

GemBox.Spreadsheet currently does not have a support for this, but you can achive your requiroment with a System.IO.Packaging namespace in WindowsBase.dll assembly.

Try the following code sample:

using System;
using System.IO;
using System.IO.Packaging;
using System.Text;

static class PdfExtractor
{
    public static void ExtractPdf(string packagePath, string destinationDirectory)
    {
        using (var package = Package.Open(packagePath))
        {
            int i = 1;
            foreach (var part in package.GetParts())
                if (part.ContentType == "application/vnd.openxmlformats-officedocument.oleObject")
                {
                    // PDF data is embedded into OLE Object package part.

                    var pdfContent = GetPdfContent(part.GetStream());
                    if (pdfContent != null)
                        File.WriteAllBytes(Path.Combine(destinationDirectory, "EmbeddedPdf" + (i++) + ".pdf"), pdfContent);
                }
        }
    }

    private static byte[] GetPdfContent(Stream stream)
    {
        // Every PDF file/data starts with '%PDF' and ends with '%%EOF'.
        const string pdfStart = "%PDF", pdfEnd = "%%EOF";

        byte[] bytes = ConvertStreamToArray(stream);

        string text = Encoding.ASCII.GetString(bytes);

        int startIndex = text.IndexOf(pdfStart, StringComparison.Ordinal);
        if (startIndex < 0)
            return null;

        int endIndex = text.LastIndexOf(pdfEnd, StringComparison.Ordinal);
        if (endIndex < 0)
            return null;

        var pdfBytes = new byte[endIndex + pdfEnd.Length - startIndex];
        Array.Copy(bytes, startIndex, pdfBytes, 0, pdfBytes.Length);

        return pdfBytes;
    }

    private static byte[] ConvertStreamToArray(Stream stream)
    {
        var buffer = new byte[16 * 1024];
        using (var ms = new MemoryStream())
        {
            int read;
            while ((read = stream.Read(buffer, 0, buffer.Length)) > 0)
                ms.Write(buffer, 0, read);

            return ms.ToArray();
        }
    }
}

Upvotes: 1

How to download embedded PDF files in an excel worksheet?

Answers (1)

Related Questions