A scalable collection for large indeterminate dataset

Question

I have a process that is working over a large dataset, processing records within a Parallel.ForEach and then storing the results in a ConcurrentQueue>. So a record is processed, and each field in the record results in a string, which is then added to the List. At the end of the record that List is then Enqueued, and further processing is done on the ConcurrentQueue holding all the processed records.

After a couple hours of processing the set I have noticed that my CPU usage has gone from a new wave to staying pretty high, and the time to process a group of records starts to grow.

My assumption here is that the List is filled to capacity and then copied into a new larger List. As the size grows the CPU required to keep up with this capacity, initialization cycle grows. The dataset I'm working with is of indeterminate size, in that each record has a variable number of child records. The number of parent records is usually in the area of 500k.

So my first thought is to initialize the List to the Count of the parent records. The List would still have to grow due to the child records, but it would at least have to grow fewer times. But is there some other collection alternative to List that scales better? Or a different approach than my first instinct which seems better?

Johan Donne · Accepted Answer

A ConcurrentQueue is implemented as a linked list and does not need to resize for capacity (unlike the regular Queue). So your problem will be elsewhere.

You might want to look into the amount of memory used and rate of garbage collection caused by cleaning up processed Lists.

Other tips:

if there is a lot of string manipulation when constructing the string from a field, use Stringbuilder (if you are not already doing that).
if there are lots of fields in a record and you have a way of knowing beforehand how many: use an array per record instead of a List, or set the List-capacity to a value that will accomodate all strings for the record.

A scalable collection for large indeterminate dataset

Answers (1)

Related Questions