user1573820
user1573820

Reputation: 147

Slow unzipping of text files using c# dotnetzip .NET 4.0

I am making a method to extract information from zipped files. All the zip files will contain just one text file. It is the intend that method should return a string array.

I am using dotnetzip, but i am experiencing a horrable performance. I have tried to benchmark the performance of each step and seems to be performing slowly on all steps.

The c# code is:

        public string[] LoadZipFile(string FileName)
    {
        string[] lines = { };
        int start = System.Environment.TickCount;
        this.richTextBoxLOG.AppendText("Reading " + FileName + "... ");
        try
        {
            int nstart;

            nstart = System.Environment.TickCount;       
            ZipFile zip = ZipFile.Read(FileName);
            this.richTextBoxLOG.AppendText(String.Format("ZipFile ({0}ms)\n", System.Environment.TickCount - nstart));

            nstart = System.Environment.TickCount;
            MemoryStream ms = new MemoryStream();
            this.richTextBoxLOG.AppendText(String.Format("Memorystream ({0}ms)\n", System.Environment.TickCount - nstart));

            nstart = System.Environment.TickCount;
            zip[0].Extract(ms);
            this.richTextBoxLOG.AppendText(String.Format("Extract ({0}ms)\n", System.Environment.TickCount - nstart));

            nstart = System.Environment.TickCount;
            string filecontents = string.Empty;
            using (var reader = new StreamReader(ms)) 
            { 
                reader.BaseStream.Seek(0, SeekOrigin.Begin); 
                filecontents = reader.ReadToEnd().ToString(); 
            }
            this.richTextBoxLOG.AppendText(String.Format("Read ({0}ms)\n", System.Environment.TickCount - nstart));

            nstart = System.Environment.TickCount;
            lines = filecontents.Replace("\r\n", "\n").Split("\n".ToCharArray());
            this.richTextBoxLOG.AppendText(String.Format("SplitLines ({0}ms)\n", System.Environment.TickCount - nstart));
        }
        catch (IOException ex)
        {
            this.richTextBoxLOG.AppendText(ex.Message+ "\n"); 

        }
        int slut = System.Environment.TickCount;
        this.richTextBoxLOG.AppendText(String.Format("Done ({0}ms)\n", slut - start)); 
        return (lines);

As an example I get this output:

Reading xxxx.zip... ZipFile (0ms) Memorystream (0ms) Extract (234ms) Read (78ms) SplitLines (187ms) Done (514ms)

A total of 514 ms. When the same operation is performed in python 2.6 using this code:

def ReadZip(File):
z = zipfile.ZipFile(File, "r")
name =z.namelist()[0]
return(z.read(name).split('\r\n'))

It executes in just 89 ms. Any ideas on how to improve performance is very welcome.

Upvotes: 0

Views: 1837

Answers (2)

user1573820
user1573820

Reputation: 147

Thanks for the suggestions. I enden up changing the code in a few ways:

  • Using a collection.generic to return lines
  • using streamreader.readline

Removing logging and exception handling did not change performance much. I looked at sharplibs unzip library, but it looked a little more complicated to implement and from what I could read on other post there was maybe a little gain in unzipping. It is now running at around 300ms.

        public List<string> LoadZipFile2(string FileName)
    {
        List<string> lines = new List<string>();
        int start = System.Environment.TickCount;
        string debugtext;
        debugtext = "Reading " + FileName + "... ";
        this.richTextBoxLOG.AppendText(debugtext);

        try
        {
            //int nstart = System.Environment.TickCount;
            ZipFile zip = ZipFile.Read(FileName);
           // this.richTextBoxLOG.AppendText(String.Format("ZipFile ({0}ms)\n", System.Environment.TickCount - nstart));

            //nstart = System.Environment.TickCount;
            MemoryStream ms = new MemoryStream();
            //this.richTextBoxLOG.AppendText(String.Format("Memorystream ({0}ms)\n", System.Environment.TickCount - nstart));

            //nstart = System.Environment.TickCount;
            zip[0].Extract(ms);
            zip.Dispose();
            //this.richTextBoxLOG.AppendText(String.Format("Extract ({0}ms)\n", System.Environment.TickCount - nstart));

            //nstart = System.Environment.TickCount;
            using (var reader = new StreamReader(ms))
            {
                reader.BaseStream.Seek(0, SeekOrigin.Begin);
                while (reader.Peek() >= 0)
                {
                    lines.Add(reader.ReadLine());
                }
            }
            ;
            //this.richTextBoxLOG.AppendText(String.Format("Read ({0}ms)\n", System.Environment.TickCount - nstart));
        }
        catch (IOException ex)
        {
            this.richTextBoxLOG.AppendText(ex.Message + "\n");
        }
        int slut = System.Environment.TickCount;
        this.richTextBoxLOG.AppendText(String.Format("Done ({0}ms)\n", slut - start));
        return (lines);

Upvotes: 1

Dan Puzey
Dan Puzey

Reputation: 34198

Your code is not like-for-like, so the comparison is unfair. Some important points:

  • Have you tried removing your logging code? The AppendText calls will be responsible for some of the extra time.
  • You do a file-wide replace before you call split, which will massively slow that part of the process. Just split on \r\n instead.
  • You convert each line to a char array instead of just returning the string. This will also slow things down.
  • You might want to compare different Zip libraries to see if there's a faster way of extracting.
  • It might be faster to repeatedly call StreamReader.ReadLine than to read the whole stream and then split it manually.

In short, you should profile some of the alternative methods, and you should time your code without using the RichTextBox for intermediate logging if you want a true like-for-like comparison.

Upvotes: 1

Related Questions