Memory leak when splitting a 10k page PDF (iTextSharp PDF API)

Question

I have a PDF that is a little more than 10,000 pages that I'm trying to split up into smaller PDFs based on a delimiter page. My current implementation works great until you start throwing the full 10k pages at it at a time. After about the 50th created pdf(~100 pages each), it will start slowing down significantly, and my memory usage jumps up to about 2GB before I get an OutOfMemoryException. I have very little experience with memory management, but I have done a lot of research. I'm resorting to asking this here simply because it's time sensitive, so I apologize if it appears that I haven't done a reasonable amount of research on my own.

My initial reading of the original PDF:

 var pdfDictionary = PDFHelper.ParsePDFByPage(_workItem.FileName);
        //Code behind
        public static Dictionary ParsePDFByPage(string filePath)
        {
            var retVal = new Dictionary();

            PdfReader reader = new PdfReader(filePath);
            for (int page = 1; page <= reader.NumberOfPages; page++)
            {
                retVal.Add(page, PdfTextExtractor.GetTextFromPage(reader, page, new StructuredTextExtractionStrategy()));
            }
            reader.Close();
            reader.Dispose();

            return retVal;
        }

After reading, I find what pages are the delimiter and create an instance of HMPdf(defined below) for each page range that needs to be split from the original

var pdfsToCreate= pdfDictionary.Where(x => x.Value.Contains("DELIMITER"));
var pdfList = new List();
foreach (var item in pdfsToCreate) //pdfsToCreate = Dictionary 
{
    //Parsing logic (most removed, just know that this part works fine)

    //After parsing, create new instance of HMPdf and add it to the list
    var pdf = new HMPdf(startPage, endPage, fileName);
    pdfList.Add(pdf);
}

After parsing, create the PDFs

foreach (var hmpdf in pdfList)
{
    //I've tried forcing the GC to collect after every 10 pdfs created
    string error = string.Empty;
    if (!hmpdf.TryCreate(sourcePath, destinationPath, out error))
    {
        throw new Exception("Error creating new PDF - " + error);
    }
}

HMPdf Code Behind

public class HMPdf
{
    private string _path;
    private string _fileName;
    private PdfCopy _pdfCopy = null;
    private PdfReader _reader = null;
    private Document _sourceDocument = null;
    private PdfImportedPage _importedPage = null;
    private int _pageFrom;
    private int _pageTo;
    private FileStream _fileStream;

    public HMPdf(int pageFrom, int pageTo, string fileName)
    {
        _pageFrom = pageFrom;
        _pageTo = pageTo;
        _fileName = fileName;
    }

    public bool TryCreate(string sourcePath, string destinationPath, out string errorMessage)
    {        
        try
        {

            _reader = new PdfReader(sourcePath);
            _sourceDocument = new Document(_reader.GetPageSizeWithRotation(_pageFrom));
            _fileStream = new System.IO.FileStream(Path.Combine(destinationPath, _fileName.ToLower().Contains(".pdf") ? _fileName : _fileName + ".pdf"),
                    System.IO.FileMode.Create);
            _pdfCopy = new PdfCopy(_sourceDocument, _fileStream);
            _sourceDocument.Open();
            for (int i = _pageFrom; i <= _pageTo; i++)
            {
                _importedPage = _pdfCopy.GetImportedPage(_reader, i);
                _pdfCopy.AddPage(_importedPage);
                _importedPage = null;
            }
            return true;
        }
        catch (Exception ex)
        {
            errorMessage = ex.Message;
            return false;
        }
        finally
        {
            if (_reader != null)
            {
                _reader.Close();
                _reader.Dispose();
                _reader = null;
            }
            if (_sourceDocument != null)
            {
                _sourceDocument.Close();
                _sourceDocument.Dispose();
                _sourceDocument = null;
            }
            if (_pdfCopy != null)
            {
                _pdfCopy.Close();
                _pdfCopy.Dispose();
                _pdfCopy = null;
            }
            if (_fileStream != null)
            {
                _fileStream.Close();
                _fileStream.Dispose();
                _fileStream = null;
            }
        }
    }
}

As you can tell, I'm closing/disposing all open filestreams, readers etc... (right?). I've tried forcing the garbage collector to run after every 10 pdfs created, but it doesn't clean anything up. I've ran Telerik JustTrace and with the little knowledge I have of memory management, a couple things stuck out. First of all between several snapshots, there were 0 disposed objects, and on the last snapshot the pdfList object was taking nearly a GB in memory.

Am I missing something completely obvious?

Sorry for the lengthy write up.

CharithJ · Accepted Answer

May be you are proving The Dangers of the Large Object Heap...

Try to improve the logic in a way it reduce memory usage.

And reduce variable scope as much as possible. Ie, don't create unnecessary class variables, make them field variables instead.

Try something like below which will reduce the scope of variables.

    public bool TryCreate(string sourcePath, string destinationPath, out string errorMessage)
    {
        try
        {

            using (var _reader = new PdfReader(sourcePath))
            {
                using (var _sourceDocument = new Document(_reader.GetPageSizeWithRotation(_pageFrom)))
                {
                    using (var _fileStream =
                        new System.IO.FileStream(
                            Path.Combine(destinationPath, _fileName.ToLower().Contains(".pdf") ? _fileName : _fileName + ".pdf"),
                            System.IO.FileMode.Create))
                    {
                        using (_pdfCopy = new PdfCopy(_sourceDocument, _fileStream))
                        {
                            _sourceDocument.Open();
                            for (int i = _pageFrom; i <= _pageTo; i++)
                            {
                                _importedPage = _pdfCopy.GetImportedPage(_reader, i);
                                _pdfCopy.AddPage(_importedPage);
                                _importedPage = null;
                            }
                        }
                    }
                }
            }
            return true;
        }

    }

Memory leak when splitting a 10k page PDF (iTextSharp PDF API)

Answers (1)

Related Questions