Reputation: 3099
My task is very simple: Read and parse a large file in C++ on Linux. There are two ways:
Parse byte by byte.
while(/*...*/) {
... = fgetc(...);
/* do something with the char */
}
Parse buffer by buffer.
while(/*...*/) {
char buffer[SOME_LARGE_NUMBER];
fread(buffer, SOME_LARGE_NUMBER, 1, ...);
/* parse the buffer */
}
Now, parsing byte by byte is easier for me (no check for how full the buffer is, etc.). However, I heard that reading large pieces is more efficient.
What is the philosophy? Is "optimal" buffering a task of the kernel, so it is already buffered when I call fgetc()
? Or is it suggested that I handle it to gain best efficiency?
Also, apart from all philosophy: What's the reality on Linux here?
Upvotes: 11
Views: 663
Reputation: 16612
Regardless of the performance or underlying buffering of fgetc()
, calling a function for every single byte you require, versus having a decent sized buffer to iterate over, is overhead that the kernel cannot help you with.
I did some quick and dirty timings for my local system (obviously YMMV).
I chose a ~200k file, and summed each byte. I did this 20000 times, alternating every 1000 cycles between reading using fgetc()
and reading using fread()
. I timed each 1000 cycles as a single lump. I compiled a release build, with optimisations turned on.
The fgetc()
loop variant was consistently 45x slower than the fread()
loop.
After prompting in the comments, I also compared getc()
, and also varying the stdio buffer. There were no noticeable changes in performance.
Upvotes: 11
Reputation: 20057
The reason for slowness of fgetc is not the amount of function calls, but the amount of system calls. fgetc
is often implemented as int fgetc(FILE *fp) { int ch; return (fread(&ch,1,1,fp)==1?ch:-1); }
Even though fread itself may buffer 64k or 1k, the system call overhead makes the difference compared to e.g.
int fgetc_buffered(FILE *fp) {
static int head=0,tail=0;
static unsigned char buffer[1024];
if (head>tail) return buffer[tail++];
tail=0;head=fread(buffer,1,1024,fp);
if (head<=0) return -1;
return buffer[tail++];
}
Upvotes: 1
Reputation: 2862
The stdio routines do user space buffering. When you call getc, fgetc, fread, they fetch data from the stdio user space buffer. When the buffer is empty, stdio will use the kernel read call to get more data.
The people who design file systems know that disk access (mainly seeks) are very expensive. So even if stdio is using a 512 byte block size, a file system might use a 4 KB block size and the kernel will read the file 4 KB at a time.
Usually the kernel will initiate a disk / network request after it gets a read. For the disk, if it sees you reading the file sequentially, it will start reading ahead (getting blocks before you ask for them) so the data is available quicker.
Also the kernel will cache files in memory. So if the file you are reading fits in memory, after one run of your program the file will stay in memory until the kernel decides it is better to cache some other files you are referencing.
Using mmap will not get the benefit of kernel read ahead.
Upvotes: 0
Reputation: 180235
Doesn't matter, really. Even from SSDs the I/O overhead dwarfs the time spent in buffering. Sure, it's now microseconds in stead of milliseconds, but function calls are measured in nanoseconds.
Upvotes: 1
Reputation: 60037
The stdio buffer is not a part of the kernel. It is a part of the user space.
However you can effect the size of that buffer using setbuf. When that buffer is not full enough the stdio library will fill it by issuing the read system function.
So it will not matter using fgetc or fread it terms of switching between kernel and user.
Upvotes: 3