Sheyko Dmitriy
Sheyko Dmitriy

Reputation: 413

Fill huge file with random data quickly

I'm trying to fill file of enormous size (>1GB) with random data.

I've written simple "thread safe random", that generates strings (solution was suggested at https://devblogs.microsoft.com/pfxteam/getting-random-numbers-in-a-thread-safe-way/), and reworking random to make random strings is trivial.

I'm trying to write this to file using this code:

String rp;

Parallel.For(1, numlines -1, i => 
{
    rp = ThreadSafeRandom.Next();
    outputFile.WriteLineAsync(rp.ToString()).Wait();
});

when line numbers are small file is generated perfectly.

When I enter bigger number of lines (say 30000) following happens:

I tried making Parallel.For(1, numlines -1, async i => with await outputFile.WriteLineAsync(rp.ToString());

and also tried doing

lock (outputFile) {
    outputFile.WriteLineAsync(rp.ToString());
}

I can always use single thread approach with simple for and simple writeLine() but as I've said I want to generate big files and I assume that even simple for loop that generates > 10000 records can take some time (in file with big size we will have 1e+6 or even 1e9 records, which is > then 20GB) and I can not think about any optimal approach.

Can someone suggest how to optimize this?

Upvotes: 0

Views: 744

Answers (1)

Theodor Zoulias
Theodor Zoulias

Reputation: 43545

Your limiting factor is probably the speed of your hard disk. Nevertheless you may gain some performance by splitting the work in two. One thread (the producer) will produce the random lines, and another thread (the consumer) will write the produced lines in the file. The code bellow writes 1,000,000 random lines to a file in my SSD in less than a second (10 MB).

BlockingCollection<string> buffer = new(boundedCapacity: 10);
Task producer = Task.Factory.StartNew(() =>
{
    Random random = new();
    StringBuilder sb = new();
    for (int i = 0; i < 10000; i++) // 10,000 chunks
    {
        sb.Clear();
        for (int j = 0; j < 100; j++) // 100 lines each chunk
        {
            sb.AppendLine(random.Next().ToString());
        }
        buffer.Add(sb.ToString());
    }
    buffer.CompleteAdding();
}, default, TaskCreationOptions.LongRunning, TaskScheduler.Default);
Task consumer = Task.Factory.StartNew(() =>
{
    using StreamWriter outputFile = new(@".\..\..\Huge.txt");
    foreach (string chunk in buffer.GetConsumingEnumerable())
    {
        outputFile.Write(chunk);
    }
}, default, TaskCreationOptions.LongRunning, TaskScheduler.Default);
Task.WaitAll(producer, consumer);

This way you don't even need thread safety in the production of the random lines, because the production happens on a single thread.


Update: In case the writing to disk is not the bottleneck, and the producer is slower than the consumer, more producers can be added. Below is a version with three producers and one consumer.

BlockingCollection<string> buffer = new(boundedCapacity: 10);
Task[] producers = Enumerable.Range(0, 3)
.Select(n => Task.Factory.StartNew(() =>
{
    Random random = new(n); // Non-random seed, same data on every run
    StringBuilder sb = new();
    for (int i = 0; i < 10000; i++)
    {
        sb.Clear();
        for (int j = 0; j < 100; j++)
        {
            sb.AppendLine(random.Next().ToString());
        }
        buffer.Add(sb.ToString());
    }
}, default, TaskCreationOptions.LongRunning, TaskScheduler.Default))
.ToArray();
Task allProducers = Task.WhenAll(producers).ContinueWith(_ =>
{
    buffer.CompleteAdding();
}, TaskScheduler.Default);
// Consumer the same as previously (omitted)
Task.WaitAll(allProducers, consumer);

Upvotes: 3

Related Questions