Reputation: 9691

Reading in a file and breaking it into 100 smaller files is incredibly slow

So I'm reading in relatively large files (>= 1GB) consisting of millions of records each of which belongs to a particular group. There are 100 groups. To work with the data more effectively, I create 100 files, 1 per group. (Using fopen in append mode.) As I read the records in from the large file, I write each one to the corresponding smaller file. I keep the file pointers for all files open throughout, so that I am not opening and closing a file with every record.

This takes an incredibly long time, and the speed of the read-in (and write) is not constant. It starts out fast, then will slow to a crawl, then speed up again, then slow. It seems to get worse the more files are read.

One possibility as to what's going on is that as they grow larger, the smaller files need to get relocated in storage. This would be surprising since I have 47GB free (of ~500). But I can't think of anything else. I'll see if refragmenting helps, but in the meantime does anyone know what's going on and how to fix this? Is there a way to pre-specify the size of a file you want to create, analogous to std::vector::reserve?

Upvotes: 2

Answers (4)

hyde

Reputation: 62797

Having just 100 open files in a process, or 100 files in one directory, should not be a bottleneck in modern systems. But simultanous random access to 101 files and total 2 GB of data can be.

I would do this:

Read some amount of records from the large file, storing records of each type into their own list in memory. Reading about 10 megabytes worth of records is probably large enough amount, that you will get decent performance, but this depends on available RAM (you do not want to use so much that OS starts using swap file for this...).

Then go through the 100 record lists in memory one at a time and append to one file at a time. You can keep all the files open, that will probably not be a problem, but you can also try closing and opening them as needed, it will not be much overhead when you work one file at a time like this.

Upvotes: 1

chill

Reputation: 16888

If you are unable or unwilling to restructure the program to write one group at a time, set bigger buffers for each of the "small" files (with setbuf, setvbuf). The effect of this is that buffer flushes to disk will exhibit more "locality", i.e. instead of flushing X amount of data 100 times to 100 different files, you will be flushing 10X amount of data 10 times to 100 different files.

Test case programs (intentionally without error checking):

-- hugefile.h --

struct record
{
  unsigned int group;
  char data [1020];
};


--- gen-hugefile.c ---

#include <stdio.h>
#include <stdlib.h>

#include "hugefile.h"

int
main (int argc, char **argv)
{
  unsigned int i, nrecords = strtol (argv [1], 0, 10);
  FILE *f;

  f = fopen ("hugefile.db", "w");

  for (i = 0; i < nrecords; ++i)
    {
      struct record r;
      r.group = rand () % 100;

      fwrite (&r, sizeof r, 1, f);
    }

  fclose (f);
  return 0;
}

--- read-hugefile.c ---

#include <stdio.h>
#include <errno.h>
#include <stdlib.h>

#include "hugefile.h"

FILE *in;
FILE *out[100];

int
main ()
{
  int i;
  char name [128];
  in = fopen ("hugefile.db", "r");

#ifdef BUFFER
  setvbuf (in, malloc (2*BUFFER), _IOFBF, 2*BUFFER);
#endif

  for (i = 0; i < 100; ++i)
    {
      sprintf (name, "out/file%03d.db", i);
      out [i] = fopen (name, "w");
#ifdef BUFFER
      setvbuf (out [i], malloc (BUFFER), _IOFBF, BUFFER);
#endif
    }

  struct record r;
  while ((i = fread (&r, sizeof r, 1, in)) == 1)
    fwrite (&r, sizeof r, 1, out [r.group]);

  fflush (0);
  return 0;
}

velco@sue:~/tmp/hugefile$ ls
gen-hugefile.c  hugefile.h  read-hugefile.c
velco@sue:~/tmp/hugefile$ gcc -O2 gen-hugefile.c -o gen-hugefile
velco@sue:~/tmp/hugefile$ ./gen-hugefile 1000000
velco@sue:~/tmp/hugefile$ ls -lh
total 978M
-rwxrwxr-x 1 velco velco 8.5K Dec 14 13:33 gen-hugefile
-rw-rw-r-- 1 velco velco  364 Dec 14 13:31 gen-hugefile.c
-rw-rw-r-- 1 velco velco 977M Dec 14 13:34 hugefile.db
-rw-rw-r-- 1 velco velco   61 Dec 14 12:56 hugefile.h
-rw-rw-r-- 1 velco velco  603 Dec 14 13:32 read-hugefile.c
velco@sue:~/tmp/hugefile$ gcc -O2 read-hugefile.c -o read-hugefile
velco@sue:~/tmp/hugefile$ gcc -O2 -DBUFFER=1048576 read-hugefile.c -o read-hugefile-buf
velco@sue:~/tmp/hugefile$ mkdir out
velco@sue:~/tmp/hugefile$ time ./read-hugefile

real    0m34.031s
user    0m0.716s
sys 0m6.204s
velco@sue:~/tmp/hugefile$ time ./read-hugefile

real    0m25.960s
user    0m0.600s
sys 0m6.320s
velco@sue:~/tmp/hugefile$ time ./read-hugefile-buf

real    0m20.756s
user    0m1.528s
sys 0m5.420s
velco@sue:~/tmp/hugefile$ time ./read-hugefile-buf

real    0m16.450s
user    0m1.324s
sys 0m5.012s
velco@sue:~/tmp/hugefile$

Upvotes: 2

Hans Passant

Reputation: 941615

You are just seeing the side-effect of the file system cache filling up to capacity, then having to wait until space frees up by the data actually getting written to disk. Which is glacially slow. While there's space in the cache, the write() call does a memory-to-memory copy, runs at 5 gigabytes per second or better. Disk writes speeds are rarely better than 30 megabytes/second. Huge difference and you are measuring the disk write speed when the cache is full.

You'll need more RAM or a faster disk.

Upvotes: 2

Open AI - Opting Out

Reputation: 24133

Sounds like you could sort them in memory and write them out one group at a time.

Upvotes: 0

Reading in a file and breaking it into 100 smaller files is incredibly slow

Answers (4)

Related Questions