Reputation: 4264
So(just for fun), i was just trying to write a C code to copy a file. I read around and it seems that all the functions to read from a stream call fgetc()
(I hope this is this true?), so I used that function:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define FILEr "img1.png"
#define FILEw "img2.png"
main()
{
clock_t start,diff;
int msec;
FILE *fr,*fw;
fr=fopen(FILEr,"r");
fw=fopen(FILEw,"w");
start=clock();
while((!feof(fr)))
fputc(fgetc(fr),fw);
diff=clock()-start;
msec=diff*1000/CLOCKS_PER_SEC;
printf("Time taken %d seconds %d milliseconds\n", msec/1000, msec%1000);
fclose(fr);
fclose(fw);
}
This gave a run time of 140 ms for this file on a 2.10Ghz core2Duo T6500 Dell inspiron laptop.
However, when I try using fread
/fwrite
, I get decreasing run time as I keep increasing the number of bytes(ie. variable st
in the following code) transferred for each call until it peaks at around 10ms! Here is the code:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#define FILEr "img1.png"
#define FILEw "img2.png"
main()
{
clock_t start,diff;
// number of bytes copied at each step
size_t st=10000;
int msec;
FILE *fr,*fw;
// placeholder for value that is read
char *x;
x=malloc(st);
fr=fopen(FILEr,"r");
fw=fopen(FILEw,"w");
start=clock();
while(!feof(fr))
{
fread(x,1,st,fr);
fwrite(x,1,st,fw);
}
diff=clock()-start;
msec=diff*1000/CLOCKS_PER_SEC;
printf("Time taken %d seconds %d milliseconds\n", msec/1000, msec%1000);
fclose(fr);
fclose(fw);
free(x);
}
Why this is happening? I.e if fread
is actually multiple calls to fgetc
then why the speed difference?
EDIT: specified that "increasing number of bytes" refers to the variable st
in the second code
Upvotes: 9
Views: 15325
Reputation: 393809
You are forgetting about file buffering (inode, dentry and page caches).
Clear them before you run:
echo 3 > /proc/sys/vm/drop_caches
Benchmarking is an art. Refer to bonnie++
, iozone
and phoronix
for proper filesystem benchmarking. As a characteristic, bonnie++
won't allow a benchmark with a written volume of less than 2x the available system memory.
Why?
(answer: buffering effects!)
Upvotes: 9
Reputation: 4495
stdio functions will fill a read buffer, of size "BUFSIZ" as defined in stdio.h, and will only make one read(2) system call every time that buffer is drained. They will not do an individual read(2) system call for every byte consumed -- they read large chunks. BUFSIZ is typically something like 1024 or 4096.
You can also adjust that buffer's size, if you wish, to increase it -- see the man pages for setbuf/setvbuf/setbuffer on most systems -- though that is unlikely to make a huge difference in performance.
On the other hand, as you note, you can make a read(2) system call of arbitrary size by setting that size in the call, though you get diminishing returns with that at some point.
BTW, you might as well use open(2) and not fopen(3) if you are doing things this way. There is little point in fopen'ing a file you are only going to use for its file descriptor.
Upvotes: 3
Reputation: 1202
Like sehe says its partly because buffering, but there is more to it and I'll explain why is that and at the same why fgetc()
will give more latency.
fgetc()
is called for every byte that is read from from file.
fread()
is called for every n bytes of the local buffer for file data.
So for a 10MiB file:
fgetc()
is called: 10 485 760 times
While fread
with a 1KiB buffer the function called 10 240 times.
Lets say for simplicity that every function call takes 1ms:
fgetc
would take 10 485 760 ms = 10485.76 seconds ~ 2,9127 hours
fread
would take 10 240 ms = 10.24 seconds
On top of that the OS does reading and writing on usually the same device, I suppose your example does it on the same hard disk. The OS when reading your source file, move the hard disk heads over the spinning disk platters seeking the file and then reads 1 byte, put it on memory, then move again the reading/writing head over the hard disk spinning platters looking on the place that the OS and the hard disk controller agreed to locate the destination file and then writes 1 byte from memory. For the above example this happens over 10 million times for each file: totaling over 20 million times, using the buffered version this happens just a grand total of over 20 000 times.
Besides that the OS when reading the disk puts in memory a few more KiB of hard disk data for performance purposes, an this can speed up the program even when using the less efficient fgetc
because the program read from the OS memory instead of reading directly from the hard disk. This is to what sehe's response refers.
Depending on your machine configuration/load/OS/etc your results from reading and writing can vary a lot, hence his recommendation to empty the disk caches to grasp better more meaningful results.
When source and destination files are on different hdd things are a lot faster. With SDDs I'm not really sure if reading/writing are absolutely exclusive of each other.
Summary: Every call to a function has certain overhead, reading from a HDD has other overheads and caches/buffers help to get things faster.
Other info
Upvotes: 4
Reputation: 11906
fread()
is not calling fgetc()
to read each byte.
It behaves as if calling fgetc()
repeatedly, but it has direct access to the buffer that fgetc()
reads from so it can directly copy a larger quantity of data.
Upvotes: 22