greydamian
greydamian

Reputation: 153

Optimal Buffer Sizing

I guess this is a performance computing question. I'm writing a program in C which produces a large amount of output, much more than can typically be stored in RAM in its entirety. I intend to simply write the output to stdout; so it may just be going to the screen, or may be being redirected into a file. My problem is how to choose an optimal buffer size for the data that will be stored in RAM?

The output data itself isn't particularly important, so let's just say that it's producing a massive list of random integers.

I intend to have 2 threads: one which produces the data and writes it to a buffer, and the other which writes that buffer to stdout. This way, I can begin producing the next buffer of output whilst the previous buffer is still being written to stdout.

To be clear, my question isn't about how to use functions like malloc() and pthread_create() etc. My question is purely about how to choose a number of bytes (512, 1024, 1048576) for the optimal buffer size, which will give the best performance?

Ideally, I'd like to find a way in which I could choose an optimal buffer size dynamically, so that my program could adjust to whatever hardware it was being run on at the time. I have tried searching for answers to this problem, and although I found a few threads about buffer size, I couldn't find anything particularly relevant to this problem. Therefore, I just wanted to post it as a question in the hope that I could get some different points of view, and come up with something better than I could on my own.

Upvotes: 3

Views: 711

Answers (5)

Philipp Claßen
Philipp Claßen

Reputation: 43979

Short answer: Measure it.

Long answer: From my experience, it depends too much on the factors that are hard to predict in advance. On the other hand, you do not have to commit yourself before the start. Just implement a generic solution and when you are done, make a few performance tests and take the settings with the best results. A profiler may help you to concentrate on the performance critical parts in your program.

From what I've seen, those that produce the fastest code, often try the simplest, straightforward approach first. What the do better than the average programmers is that they have a very good technique at writing good performance tests, which is by far not trivial.

Without experience, it is easy to fall into certain traps, for example, ignoring caching effects, or (maybe in your application?!) underestimating the costs of IO operations. In the worst case, you end up squeezing parts of the program which do not contribute to the overall performance at all.

Back to your original question:

In the scenario that you describe (one CPU-bound producer and one IO-bound consumer), it is likely that one of them will be the bottleneck (unless the rate at which the producer generates data varies a lot). Depending on which one is faster, the whole situation changes radically:

Let us first assume, the IO-bound consumer is your bottleneck (doesn't matter whether it writes to stdout or to a file). What are the likely consequences?

Optimizing the algorithm to produce the data will not improve performance, instead you have to maximize the write performance. I would assume, however, that the write performance will not depend very much on the buffer size (unless the buffer is too small).

In the other case, if the producer is the limiting factor, the situation is reversed. Here you have to profile the generation code and improve the speed of the algorithm and maybe the communication of the data between the reader and the writer thread. The buffer size, however, will still not matter, as the buffer will be empty most of the time, anyway.

Granted, the situation could be more complex than I have described. But unless you are actually sure that you are not in one of the extreme cases, I would not invest in tuning the buffer size yet. Just keep it configurable and you should be fine. I don't think that it should be a problem later to refit it to other hardware environments.

Upvotes: 1

Mats Petersson
Mats Petersson

Reputation: 129374

I personally think you are wasting your time.

First, run time ./myprog > /dev/null

Now, use time dd if=/dev/zero of=myfile.data bs=1k count=12M.

dd is about as simple a program as you can get, and it will write the file pretty quickly. But writing several gigabytes still takes a little while. (12G takes about 4 minutes on my machine - which is probably not the fastest disk in the world - the same size file to /dev/null takes about 5 seconds).

You can experiment with some different numbers in bs=x count=y where the combination makes, the same size as your program output for the test-run. But I only found that if you make VERY large blocks, it actually takes longer (1MB per write - probably because the OS needs to copy 1MB before it can write the data, then write it out and then copy the next 1MB, where with smaller blocks (I tested 1k and 4k), it takes a lot less time to copy the data, and there's actually less "disk spinning round not doing anything before we write to it").

Compare both of these times to your program running time. Is the time it takes to write the file the with dd much shorter than your program writing to the file?

If there isn't much difference, then look at the time it takes to write to /dev/null with your program - is that accounting for some or all of the difference?

Upvotes: 1

0xJarno
0xJarno

Reputation: 31

There's no need to use buffering, the OS will automatically swap pages to the disk for you whenever necessary, you don't have to program that. The simples would be for you to leave in in RAM if you don't need to save the data, else you're probably better of saving it after generating the data, because it's better for the disk i/o.

Upvotes: 0

DigitalRoss
DigitalRoss

Reputation: 146073

It's a big waste of time to mix design and optimization. This is considered one of the top canonical mistakes. It is likely to damage your design and not actually optimize much.

Get your program working, and if there is an indication of a performance issue, then profile it and consider analyzing the part that's really causing the problem.

I would think this applies particularly to a complex architectural optimization like multithreading your application. Multithreading a single image is something you never really want to do: it's impossible to test, prone to unreproducible bugs, it will fail differently in different execution environments, and there are other problems. But, for some programs, multithreaded parallel execution is required for functionality or is one way to get necessary performance. It's widely supported, and essentially it is, at times, a necessary evil.

It's not something you want in the initial design without solid evidence that programs like yours need it.

Almost any other method of parallelism (message passing?) will be easier to implement and debug, and you are getting lots of that in the I/O system of your OS anyway.

Upvotes: 6

aagdbl
aagdbl

Reputation: 99

Most modern OSes are good at using the disk as a backing store for RAM. I suggest you leave the heuristics to the OS and just ask for as much memory as you want, till you hit a performance bottleneck.

Upvotes: 0

Related Questions