Bin03
Bin03

Reputation: 59

Multithreaded deserializing bottlenecked by GC

I have a program that loads a data from disk to byte array, so IO operation just happens once, no IO bottleneck. Then i deserialize this data of byte array using all of my CPU cores/threads (12). This operation deserialize the byte array into objects so there is a lot of heap allocation, creating a new object for each deserialized data. All of my thread workers operates independently, no locking, and all of them reads from same single byte array.

I was expecting my program to use the full power of my CPU while doing it, but it didn't. Even though my program splits the workload across different Tasks up to my CPU thread count. On average, it utilizes 30-60% CPU usage at most. So i try to research and found this question, and one of the answer said "Concurrency Visualizer" will help me profile this. It also mention to enable GC Server but it didn't worked for me, maybe i did it wrong and anybody could point me out that.

Profiling Result

I profile my program with "Concurrency Visualizer", then i discovered a lot of blocking times in all of my worker threads. I also discovered while all of my worker threads are blocked, there is always a single random worker threads that executing alone. I inspect the call stack of this lone-executing-worker-thread, the call stack frame name mentions gc_heap which i presume it is doing garbage collecting while all of my other worker threads paused during GC execution. But the thing is, my program does not allocate object unneccessary, only does so when serialized object is fully deserialized. So there must be allocation somewhere and i could not find it in my code anyway. Maybe the ghost allocation happens outside my code (e.g List resizing).

You can see below the screenshot of my program execution timeline (the top one) and the detail of the lone-executing-worker-thread call stack:

Concurrency Visualizer result

I also try to run my program again in single threaded, profile it, then this time i noticed the program doesn't get blocked by GC at all. I then interested what happen if i keep repeatedly rerun the program that in each repetitions, i added more thread. I noticed in the profiler that, the more thread i use, the more GC blocking it is.

So my question is how do i do multithreaded data deserialization without GC gets in way (or remove ghost allocation) that cause my program to underutilize my CPU?

Summary of my question description:

If you need to take a look at my deserializer code, you can check it here

Edit

As requested by Theodor Zoulias, below is an example of the implementation of what the deserialized data structure will be and the deserializer implementation:

Data Structure for deserialized data

public interface IData 
{
    public string Name { get; set; }
}

public class DataInt : IData
{
    public string Name { get; set; }
    public int Value { get; set; }
    public DataInt(string name, int value)
    {
        Name = name;
        Value = value;
    }
}

public class DataIntArray : IData
{
    public string Name { get; set; }
    public int[] Values { get; set; }
    public DataIntArray(string name, int[] values)
    {
        Name = name;
        Values = values;
    }
}

public class DataString : IData
{
    public string Name { get; set; }
    public string Value { get; set; }
    public DataString(string name, string value)
    {
        Name = name;
        Value = value;
    }
}

public class DataList<T> : List<T>, IData where T : class, IData
{
    public string Name { get; set;}
    public DataList(string name, int capacity) : base(capacity)
    {
        Name = name;
    }
}

public class DataDict : Dictionary<string, IData>, IData
{
    public string Name { get; set;}
    public DataDict(string name, int capacity) : base(capacity)
    {
        Name = name;
    }
}

Deserializer implementation example, for sake of simplicity i am not checking if end of stream is reached or something

/*
Concept: Serialized object has 3 properties: Type ID, Name, and value
- Type ID is a single byte, valid ID range from 1 to 5:
  - 1: Value is integer  
  - 2: Value is array of integer
  - 3: Value is a string
  - 4: Value is a length-prefixed list of another serialized objects, may be nested
  - 5: Value is a dictionary of another serialized objects, may be nested.
       Length is unknown (possible dictionary resizing), but if it encounter byte 0, 
       it signals where the elements end (see below description)
Type ID Zero is magic ID that signal ReadDictData that tells "Hey, you have got all
the elements inside you, so stop assuming the next deserialized data is inside you."

- Name is a length prefixed string
- Value is the value of the data. The datatype depends on the ID
*/

public static class Deserializer
{
    public static IData Parse(Stream stream)
    {
        int type = stream.ReadByte();
        return ReadDataSwitch(new BinaryReader(stream), type, insideList: false);
    }

    private static IData ReadDataSwitch(BinaryReader reader, int type, bool insideList)
    {
        // data does not have name if inside list
        string name = insideList ? "" : reader.ReadString();
        return type switch
        {
            0 => throw new InvalidDataException("Termination value outside DataDict"),
            1 => ReadIntData(reader, name),
            2 => ReadIntArrayData(reader, name),
            3 => ReadStringData(reader, name),
            4 => ReadListDataSwitch(reader, name),
            5 => ReadDictData(reader, name),
            _ => throw new InvalidDataException("Unknown type ID")
        };
    }

    private static DataInt ReadIntData(BinaryReader reader, string name)
    {
        int value = reader.ReadInt32();
        return new DataInt(name, value);
    }

    private static DataString ReadStringData(BinaryReader reader, string name)
    {
        int stringLength = reader.ReadInt32();
        // allocate on heap instead if exceed 1024 bytes
        Span<byte> buff = stackalloc byte[stringLength];
        string value = System.Text.Encoding.UTF8.GetString(buff);
        return new DataString(name, value);
    }

    private static DataIntArray ReadIntArrayData(BinaryReader reader, string name)
    {
        // length prefixed array, we can determine array size
        int length = reader.ReadInt32();
        int[] values = new int[length];
        for (int i = 0; i < length; i++)
            values[i] = reader.ReadInt32();
        return new DataIntArray(name, values);
    }

    private static IData ReadListDataSwitch(BinaryReader reader, string name)
    {
        int type = reader.ReadByte();
        return type switch
        {
            0 => throw new InvalidDataException("Termination value outside DataDict"),
            1 => ReadListData<DataInt>(reader, type, name),
            2 => ReadListData<DataIntArray>(reader, type, name),
            3 => ReadListData<DataString>(reader, type, name),
            4 => ReadListData<IData>(reader, type, name), // nested list, polymorph upcast it
            5 => ReadListData<DataDict>(reader, type, name),
            _ => throw new InvalidDataException("Unknown type ID")
        };

    }

    private static DataList<T> ReadListData<T>(BinaryReader reader, int type, string name) where T : class, IData
    {
        // Prefixed with length, we can determine initial capacity, no list resizing
        int listLength = reader.ReadInt32();
        DataList<T> dataList = new(name, capacity: listLength);
        for (int i = 0; i < listLength; i++)
        {
            // theoretically cast will always succeed as we know the type ID
            T data = (ReadDataSwitch(reader, type, insideList: true) as T)!;
            dataList.Add(data);
        }
        return dataList;
    }

    private static DataDict ReadDictData(BinaryReader reader, string name)
    {
        // Does not prefixed with length, therefore we cannot determine initial size.
        // Resizing may be possible if the item count is above 50 (Beware GC reallocation)
        DataDict dataDict = new(name, 50);
        int type = reader.ReadByte();
        while (type != 0)
        {
            // type (or ID) 0 is special ID to mark the end of dictionary
            // deserializer
            IData data = ReadDataSwitch(reader, type, insideList: false);
            dataDict.Add(data.Name, data);

            // read next serialized data
            type = reader.ReadByte();
        }
        return dataDict;
    }
}

#Edit for context clarity What my code does is, lets say you have 100 files in your folders. Then my Program will take all these 80 files into 100 byte arrays, and then process (deserialize) multiple byte arrays up to your machine CPU threads. All these thread workers works on separate byte arrays, so it is executing on complete isolation

Upvotes: 0

Views: 118

Answers (1)

Bin03
Bin03

Reputation: 59

I stopped believing stubbornly that the culprit to this bottleneck is allocation outside my code. I will come back to it again once it become an issue again. I did what Evk said in the comment according to the linked question, which is to change the GC mode to Server. My program now use 100% utilization of my CPU, bottleneck removed, problem solved.

However, i noticed the Server GC as double edged sword: when GC kicks in, it doesn't release memory back to the operating system immediately, it holds on to it. On workstation GC, my Program only use 1.2 gigabytes ram, meanwhile on server GC, it can use up to 4 gigabytes ram. It may be disadvantage to other processes, but since my program still holds on to it, it's an advantage in case my program needs more memory, it will just reuse that portion of memory without having to make memory request to the OS too much

Upvotes: 1

Related Questions