ManInMoon
ManInMoon

Reputation: 7005

Why is binary file vary large compared to text?

I have been keeping a large set of data as TEXT records in a TEXT file:

yyyyMMddTHHmmssfff doube1 double2

However when I read it I need to parse each DateTime. This is quite slow for millions of records.

So, now I am trying it as a binary file which I created by serlializing my class.

That way I do not need to parse the DateTime.

    class MyRecord 
    {
           DateTime DT;
           double Price1;
           double Price2;
    }

            public byte[] SerializeToByteArray()
            {
                var bf = new BinaryFormatter();
                using (var ms = new MemoryStream())
                {
                    bf.Serialize(ms, this);
                    return ms.ToArray();
                }
            }

    MyRecord mr = new MyRecord();

    outBin = new BinaryWriter(File.Create(binFileName, 2048, FileOptions.None));

   for (AllRecords) //Pseudo
    {
        mr = new MyRecord(); //Pseudo
        outBin.Write(mr.SerializeToByteArray());
    }

The resulting binary is on average 3 times the size of the TEXT file.

Is that to be expected?

EDIT 1

I am exploring using Protbuf to help me:

I want to do this with using USING to fit my existing structure.

   private void DisplayBtn_Click(object sender, EventArgs e)
    {
        string fileName = dbDirectory + @"\nAD20120101.dat";

        FileStream fs = File.OpenRead(fileName);

        MyRecord tr;
        while (fs.CanRead)
        {

            tr = Serializer.Deserialize<MyRecord>(fs);

            Console.WriteLine("> "+ tr.ToString());

        }

    }

BUT after first record tr - full of zeroes.

Upvotes: 0

Views: 432

Answers (3)

BRAHIM Kamel
BRAHIM Kamel

Reputation: 13794

As Requested by the OP.

the output is not a binary file it's binary serialization of instances plus an overhead of BinaryFormatter to allow deserialization later for this reason you get 3 times the file large than expected if you need a smart serialization solution you can take a look at ProtoBuf-net https://code.google.com/p/protobuf-net/

here you can find a link explaining how you can achieve this

 [ProtoContract]
Public class MyRecord 
    {   [ProtoMember(1)]
           DateTime DT;
         [ProtoMember(2)]
           double Price1;
          [ProtoMember(3)]
           double Price2;
    }  

Upvotes: 0

Mathias
Mathias

Reputation: 1500

You are not storing a simple binary version of your DateTime, but an object representing those. That is much larger then simply storing your Date as Text.

If you create a class

class MyRecords
{
    DateTime[] DT;
    double[] Price1;
    double[] Price2;
}

And serialize that, it should be much smaller.

Also I guess DateTime still needs lots of space, so you can convert your DateTime to a Integer Unix Timestamp and store that.

Upvotes: 0

sehe
sehe

Reputation: 394044

Your archive likely has considerable overhead serializing type information with each record.

Instead, make the whole collection serializable (if it isn't already) and serialize that in one go.

Upvotes: 1

Related Questions