Sam
Sam

Reputation: 432

Fast Read/Write of Complex .Net Object Graph

I have my own data structure written in C# (the structure is quite complex). I need to serialize and deserialize the structure. The size of the serialized file in disk can be pretty large at times (close to a 1 GB) but it can be small also (based on the number of records stored). I have the following requirements:

  1. The serialization and deserialization should be very fast
  2. I should be able to partially deserialize a large file (i.e. access only some relevant records), because if I deserialize the whole file from disk the memory usage will be too high.
  3. Should be thread-safe as multiple processes can write/read records from the file

I know it sounds like I need a database but I cannot use one for multiple reasons. I tried fulfiling requirement 1 by implementing ISerializable, which made it much faster than using the .net built in Binary/XML serializers, but not fast enough. For requirement 2 is completely stumping me.

Anybody out there got any ideas on how to go about this? I am thinking anybody who has had to save their own large file formats had to deal with similar issues.

Regards, Sam

Upvotes: 4

Views: 1891

Answers (4)

Marc Gravell
Marc Gravell

Reputation: 1064054

Is is a data tree, or a full graph - i.e. are there any circular references? If not, protobuf-net is a high-performance binary tree serializer. It supports streaming of enumerable items (so you can skip records, etc - rather than buffer everything), but to efficiently seek to a random element I expect you'd need some kind of index.

Read/write is VERY hard for a single file; in particular, writing may require moving lots more of the disk than you expect... reading is also tricky and might require synchronization. It would be easier to use separate files...


Example of skipping early items; I could probably add a helper method, but the TryDeserializeWithLengthPrefix method will work... the key point being to watch that between serialization and deserialization we only create one extra object.

using System;
using System.IO;
using System.Threading;
using ProtoBuf;

[ProtoContract]
class Foo {
    static int count;
    public static int ObjectCount { get { return count; } }
    public Foo() { // track how many objects have been created...
        Interlocked.Increment(ref count);
    }
    [ProtoMember(1)]
    public int Id { get; set; }
    [ProtoMember(2)]
    public double Bar { get; set; }    
}
static class Program {
    static void Main() {
        MemoryStream ms = new MemoryStream();
        Random rand = new Random();
        for (int i = 1; i <= 5000; i++) {
            Foo foo = new Foo { Bar = rand.NextDouble(), Id = i };
            Serializer.SerializeWithLengthPrefix(ms, foo,PrefixStyle.Base128, 1);
        }
        ms.Position = 0;
        // skip 1000
        int index = 0;
        object obj;
        Console.WriteLine(Foo.ObjectCount);
        Serializer.NonGeneric.TryDeserializeWithLengthPrefix(
            ms, PrefixStyle.Base128,
            tag => ++index == 1000 ? typeof(Foo) : null, out obj);
        Console.WriteLine(Foo.ObjectCount);
        Console.WriteLine(((Foo)obj).Id);
    }
}

Upvotes: 2

Cecil Has a Name
Cecil Has a Name

Reputation: 4992

For partial (or split) deserialization (which I have been looking sime into myself, such as dynamic and static parts in a game level), I think you will have to write your own serialization engine.

Upvotes: 0

rAm
rAm

Reputation: 1106

I haven't worked on any scenario as you have here. However, I have in the past discussed a similar issue and here is the outcome of the discussion. (Though I confess I haven't ever seen the implementation). Also, I am afraid there may not be any simple straight forward solution.

Assumptions:

i. The data to be written is sorted.

Solution:

i. Fragment your data store into multiple files. Allot a range of the sorted values to each file. eg. record 1-10000 in file 1, record 100001-20000 in file 2 and so on.

ii. When you write/read data, you know the range before hand so you can meet point 2.

iii. It will also solve point 3 as long as the chance of two or more processes requesting the exact same data is less.

To be able to provide more accurate solution we need more information on what you are trying to achieve.

Upvotes: 2

fretje
fretje

Reputation: 8382

I think we'll need more information as how the file actually looks like...

Can't you just read in pieces of sizeof(yourstruct) from the file, and process them separately iso reading all the records in memory?

Upvotes: 0

Related Questions