Reputation: 2992
I have a program that saves many large files >1GB using fwrite
It works fine, but unfortunately due to the nature of the data each call to fwrite
only writes 1-4bytes. with the result that the write can take over an hour, with most of this time seemingly due to the syscall overhead (or at least in the library function of fwrite). I have a similar problem with fread
.
Does anyone know of any existing / library functions that will buffer these writes and reads with an inline function, or is this another roll your own?
Upvotes: 9
Views: 17063
Reputation: 11
Below is an evidence showing that fwrite() incurs significant overhead for massive number of small writes, even with a large buffer (400MB). I test my code on Ubuntu with SSD. The running time of each function is in the comments.
So the solution to your question will be either: 1) gather all your small data in a big memory block and write to disk with one fwrite call or 2) implement your own buffered reader and writer.
(You may also want to refer a related question.)
#include<bits/stdc++.h>
using namespace std;
char filename[64] = "test.intarr";
void test_write_fstream(const vector<int>& v) {//11.395 s
ofstream out(filename, ios::binary);
out.write((char*)v.data(), v.size()*sizeof(int));
out.close();
}
void test_write_fstream_in_loop(const vector<int>& v) {// 42.284 s
ofstream out(filename, ios::binary);
for (size_t i = 0; i < v.size(); i++)
{
out.write((char*)&v[i], sizeof(int));
}
out.close();
}
void test_write_fwrite(const vector<int>& v) {// 11.466s
FILE* out = fopen(filename, "wb");
fwrite(v.data(), sizeof(int), v.size(), out);
fclose(out);
}
void test_write_fwrite_loop(const vector<int>& v) {// 59.338s
FILE* out = fopen(filename, "wb");
for (size_t i = 0; i < v.size(); i++)
{
fwrite(&v[i], sizeof(int), 1, out);
}
fclose(out);
}
void test_write_fwrite_unlocked_loop(const vector<int>& v) {//33.676 s
FILE* out = fopen(filename, "wb");
for (size_t i = 0; i < v.size(); i++)
{
fwrite_unlocked(&v[i], sizeof(int), 1, out);
}
fclose(out);
}
void test_write_fwrite_unlocked_loop_buffered(const vector<int>& v) {//28.198 s (400M buffer)
char * buffer = (char*)malloc(400*1024*1024);
FILE* out = fopen(filename, "wb");
setvbuf(out, buffer, _IOFBF, 400*1024*1024);
for (size_t i = 0; i < v.size(); i++)
{
fwrite_unlocked(&v[i], sizeof(int), 1, out);
}
fclose(out);
free(buffer);
}
void test_write_fwrite_loop_with_4M_buffer(const vector<int>& v) {// 53.229 (4M buffer), 52.537 (400M buffer)
char *buffer = (char*)malloc(400*1024*1024);
FILE* out = fopen(filename, "wb");
// char buffer[400*1024*1024];
// set buffer of 4M
setvbuf(out, buffer, _IOFBF, 400*1024*1024);
for (size_t i = 0; i < v.size(); i++)
{
fwrite(&v[i], sizeof(int), 1, out);
}
fclose(out);
free(buffer);
}
int main() {
vector<int> v;
auto start = chrono::high_resolution_clock::now();
v.resize(1000*1000*1000);
for (size_t i = 0; i < v.size(); i++)
{// 36130 ms
v[i] = rand();
}
auto end = chrono::high_resolution_clock::now();
cout << "Generate time: " << chrono::duration_cast<chrono::milliseconds>(end-start).count() << "ms" << endl;
//get current time in milliseconds
start = chrono::high_resolution_clock::now();
test_write_fwrite_unlocked_loop(v);
end = chrono::high_resolution_clock::now();
cout << "Write time: " << chrono::duration_cast<chrono::milliseconds>(end-start).count() << "ms" << endl;
return 0;
}
Upvotes: 1
Reputation: 2301
If you write from just one thread, try using fwrite_unlocked
. It does wonders relative to straight fwrite
in this kind of scenarios.
Upvotes: 1
Reputation: 3621
Here's a test in nim
showing that fwrite
introduces function call overhead, and batching on your end decreases clock time.
as batchPow
increases from 0 to 10, clock time decreases from 36 seconds to 4 seconds
nim r -d:case1 -d:danger --gc:arc main.nim | wc -l
36 seconds
nim r -d:case2 -d:danger --gc:arc -d:batchPow:10 main.nim | wc -l
4 seconds
Even LTO won't help with fwrite's function call overhead as you can see with -d:case1 --passc:-flto --passl:-flto
var buf: string
let n = 1000_000_000
for i in 0..<n:
let c = cast[char](i)
when defined case1: # 36 seconds
stdout.write c
when defined case2: # 4 seconds
const batchPow {.intdefine.} = 10
buf.add c
if ((i and (2 shl batchPow - 1)) == 0) or (i == n-1):
stdout.write buf
buf.setLen 0
Upvotes: 1
Reputation: 71060
OK, well, that was interesting. I thought I'd write some actual code to see what the speed was. And here it is. Compiled using C++ DevStudio 2010 Express. There's quite a bit of code here. It times 5 ways of writing the data:-
Please check that I've not done something a bit stupid with any of the above.
The program uses QueryPerformanceCounter for timing the code and ends the timing after the file has been closed to try and include any pending internal buffered data.
The results on my machine (an old WinXP SP3 box):-
You may get different results depending on your setup.
Feel free to edit and improve the code.
#define _CRT_SECURE_NO_WARNINGS
#include <stdio.h>
#include <memory.h>
#include <Windows.h>
const int
// how many times fwrite/my_fwrite is called
c_iterations = 10000000,
// the size of the buffer used by my_fwrite
c_buffer_size = 100000;
char
buffer1 [c_buffer_size],
buffer2 [c_buffer_size],
*current_buffer = buffer1;
int
write_ptr = 0;
__int64
write_offset = 0;
OVERLAPPED
overlapped = {0};
// write to a buffer, when buffer full, write the buffer to the file using fwrite
void my_fwrite (void *ptr, int size, int count, FILE *fp)
{
const int
c = size * count;
if (write_ptr + c > c_buffer_size)
{
fwrite (buffer1, write_ptr, 1, fp);
write_ptr = 0;
}
memcpy (&buffer1 [write_ptr], ptr, c);
write_ptr += c;
}
// write to a buffer, when buffer full, write the buffer to the file using Win32 WriteFile
void my_fwrite (void *ptr, int size, int count, HANDLE fp)
{
const int
c = size * count;
if (write_ptr + c > c_buffer_size)
{
DWORD
written;
WriteFile (fp, buffer1, write_ptr, &written, 0);
write_ptr = 0;
}
memcpy (&buffer1 [write_ptr], ptr, c);
write_ptr += c;
}
// write to a double buffer, when buffer full, write the buffer to the file using
// asynchronous WriteFile (waiting for previous write to complete)
void my_fwrite (void *ptr, int size, int count, HANDLE fp, HANDLE wait)
{
const int
c = size * count;
if (write_ptr + c > c_buffer_size)
{
WaitForSingleObject (wait, INFINITE);
overlapped.Offset = write_offset & 0xffffffff;
overlapped.OffsetHigh = write_offset >> 32;
overlapped.hEvent = wait;
WriteFile (fp, current_buffer, write_ptr, 0, &overlapped);
write_offset += write_ptr;
write_ptr = 0;
current_buffer = current_buffer == buffer1 ? buffer2 : buffer1;
}
memcpy (current_buffer + write_ptr, ptr, c);
write_ptr += c;
}
int main ()
{
// do lots of little writes
FILE
*f1 = fopen ("f1.bin", "wb");
LARGE_INTEGER
f1_start,
f1_end;
QueryPerformanceCounter (&f1_start);
for (int i = 0 ; i < c_iterations ; ++i)
{
fwrite (&i, sizeof i, 1, f1);
}
fclose (f1);
QueryPerformanceCounter (&f1_end);
// do a few big writes
FILE
*f2 = fopen ("f2.bin", "wb");
LARGE_INTEGER
f2_start,
f2_end;
QueryPerformanceCounter (&f2_start);
for (int i = 0 ; i < c_iterations ; ++i)
{
my_fwrite (&i, sizeof i, 1, f2);
}
if (write_ptr)
{
fwrite (buffer1, write_ptr, 1, f2);
write_ptr = 0;
}
fclose (f2);
QueryPerformanceCounter (&f2_end);
// use Win32 API, without buffer
HANDLE
f3 = CreateFile (TEXT ("f3.bin"), GENERIC_WRITE, 0, 0, CREATE_ALWAYS, FILE_ATTRIBUTE_NORMAL, 0);
LARGE_INTEGER
f3_start,
f3_end;
QueryPerformanceCounter (&f3_start);
for (int i = 0 ; i < c_iterations ; ++i)
{
DWORD
written;
WriteFile (f3, &i, sizeof i, &written, 0);
}
CloseHandle (f3);
QueryPerformanceCounter (&f3_end);
// use Win32 API, with buffer
HANDLE
f4 = CreateFile (TEXT ("f4.bin"), GENERIC_WRITE, 0, 0, CREATE_ALWAYS, FILE_FLAG_WRITE_THROUGH, 0);
LARGE_INTEGER
f4_start,
f4_end;
QueryPerformanceCounter (&f4_start);
for (int i = 0 ; i < c_iterations ; ++i)
{
my_fwrite (&i, sizeof i, 1, f4);
}
if (write_ptr)
{
DWORD
written;
WriteFile (f4, buffer1, write_ptr, &written, 0);
write_ptr = 0;
}
CloseHandle (f4);
QueryPerformanceCounter (&f4_end);
// use Win32 API, with double buffering
HANDLE
f5 = CreateFile (TEXT ("f5.bin"), GENERIC_WRITE, 0, 0, CREATE_ALWAYS, FILE_FLAG_OVERLAPPED | FILE_FLAG_WRITE_THROUGH, 0),
wait = CreateEvent (0, false, true, 0);
LARGE_INTEGER
f5_start,
f5_end;
QueryPerformanceCounter (&f5_start);
for (int i = 0 ; i < c_iterations ; ++i)
{
my_fwrite (&i, sizeof i, 1, f5, wait);
}
if (write_ptr)
{
WaitForSingleObject (wait, INFINITE);
overlapped.Offset = write_offset & 0xffffffff;
overlapped.OffsetHigh = write_offset >> 32;
overlapped.hEvent = wait;
WriteFile (f5, current_buffer, write_ptr, 0, &overlapped);
WaitForSingleObject (wait, INFINITE);
write_ptr = 0;
}
CloseHandle (f5);
QueryPerformanceCounter (&f5_end);
CloseHandle (wait);
LARGE_INTEGER
freq;
QueryPerformanceFrequency (&freq);
printf (" fwrites without buffering = %dms\n", (1000 * (f1_end.QuadPart - f1_start.QuadPart)) / freq.QuadPart);
printf (" fwrites with buffering = %dms\n", (1000 * (f2_end.QuadPart - f2_start.QuadPart)) / freq.QuadPart);
printf (" Win32 without buffering = %dms\n", (1000 * (f3_end.QuadPart - f3_start.QuadPart)) / freq.QuadPart);
printf (" Win32 with buffering = %dms\n", (1000 * (f4_end.QuadPart - f4_start.QuadPart)) / freq.QuadPart);
printf ("Win32 with double buffering = %dms\n", (1000 * (f5_end.QuadPart - f5_start.QuadPart)) / freq.QuadPart);
}
Upvotes: 4
Reputation: 1357
First and foremost: small fwrites() are slower, because each fwrite has to test the validity of its parameters, do the equivalent of flockfile(), possibly fflush(), append the data, return success: this overhead adds up -- not so much as tiny calls to write(2), but it's still noticeable.
Proof:
#include <stdio.h>
#include <stdlib.h>
static void w(const void *buf, size_t nbytes)
{
size_t n;
if(!nbytes)
return;
n = fwrite(buf, 1, nbytes, stdout);
if(n >= nbytes)
return;
if(!n) {
perror("stdout");
exit(111);
}
w(buf+n, nbytes-n);
}
/* Usage: time $0 <$bigfile >/dev/null */
int main(int argc, char *argv[])
{
char buf[32*1024];
size_t sz;
sz = atoi(argv[1]);
if(sz > sizeof(buf))
return 111;
if(sz == 0)
sz = sizeof(buf);
for(;;) {
size_t r = fread(buf, 1, sz, stdin);
if(r < 1)
break;
w(buf, r);
}
return 0;
}
That being said, you could do what many commenters suggested, ie add your own buffering before fwrite: it's very trivial code, but you should test if it really gives you any benefit.
If you don't want to roll your own, you can use eg the buffer interface in skalibs, but you'll probably take longer to read the docs than to write it yourself (imho).
Upvotes: 1
Reputation: 500167
First of all, fwrite()
is a library and not a system call. Secondly, it already buffers the data.
You might want to experiment with increasing the size of the buffer. This is done by using setvbuf()
. On my system this only helps a tiny bit, but YMMV.
If setvbuf()
does not help, you could do your own buffering and only call fwrite()
once you've accumulated enough data. This involves more work, but will almost certainly speed up the writing as your own buffering can be made much more lightweight that fwrite()
's.
edit: If anyone tells you that it's the sheer number of fwrite()
calls that is the problem, demand to see evidence. Better still, do your own performance tests. On my computer, 500,000,000 two-byte writes using fwrite()
take 11 seconds. This equates to throughput of about 90MB/s.
Last but not least, the huge discrepancy between 11 seconds in my test and one hour mentioned in your question hints at the possibility that there's something else going on in your code that's causing the very poor performance.
Upvotes: 18
Reputation: 511
It should be easy to roll your own buffer. but fortunately the standard c++ has what you are asking for. Just use std::ofstream :
//open and init
char mybuffer [1024];
std::ofstream filestr("yourfile");
filestr.rdbuf()->pubsetbuf(mybuffer,1024);
// write your data
filestr.write(data,datasize);
Edited: mistake, use ofstream and not fstream as it's not clear from the the standard witch buffer is it (input or output?)
Upvotes: -1
Reputation: 14205
The point of the FILE * layer in stdio is that it does the buffering for you. This saves you from system call overhead. As noted by others, one thing that could still be an issue is the library call overhead, which is considerably smaller. Another thing that might bite you is writing to lots of different locations on disk at the same time. (Disks spin, and the head takes ballpark 8ms to get to the right place for a random write.)
If you determine that library call overhead is the problem, I'd recommend rolling your own trivial buffering using vector's and periodically flushing the vector's to the files.
If the problem is that you have lots of writes dispersed all over the disk, try jacking up the buffer sizes using setvbuf(). Try a number around 4MB per file if you can.
Upvotes: 0
Reputation: 23498
your problem is not the buffering for fwrite()
, but the total overhead of making the library call with small amounts of data. if you write just 1MB of data, you make 250000 function calls. you'd better try to collect your data in memory and then write to the disk with one single call to fwrite()
.
UPDATE: if you need an evidence:
$ dd if=/dev/zero of=/dev/null count=50000000 bs=2
50000000+0 records in
50000000+0 records out
100000000 bytes (100 MB) copied, 55.3583 s, 1.8 MB/s
$ dd if=/dev/zero of=/dev/null count=50 bs=2000000
50+0 records in
50+0 records out
100000000 bytes (100 MB) copied, 0.0122651 s, 8.2 GB/s
Upvotes: 5