Reputation: 73209

What's an efficient way to randomize the ordering of the contents of a very large file?

For my neural-network-training project, I've got a very large file of input data. The file format is binary, and it consists of a very large number of fixed-size records. The file is currently ~13GB, but in the future it could become larger; for the purposes of this question let's assume it will be too large to just hold all of it in my computer's RAM at once.

Today's problem involves a little utility program I wrote (in C++, although I think choice of language doesn't matter too much here as one would likely encounter the same problem in any language) that is intended to read the big file and output a similar big file -- the output file is to contain the same data as the input file, except with the records shuffled into a random ordering.

To do this, I mmap() the input file into memory, then generate a list of integers from 1 to N (where N is the number of records in the input file), randomly shuffle the ordering of that list, then iterate over the list, writing out to the output file the n'th record from the mmap'd memory area.

This all works correctly, as far as it goes; the problem is that it's not scaling very well; that is, as the input file's size gets bigger, the time it takes to do this conversion is increasing faster than O(N). It's getting to the point where it's become a bottleneck for my workflow. I suspect the problem is that the I/O system (for MacOS/X 10.13.4, using the internal SSD of my Mac Pro trashcan, in case that's important) is optimized for sequential reads, and jumping around to completely random locations in the input file is pretty much the worst-case scenario as far as caching/read-ahead/other I/O optimizations are concerned. (I imagine that on a spinning disk it would perform even worse due to head-seek delays, but fortunately I'm at least using SSD here)

So my question is, is there any clever alternative strategy or optimization I could use to make this file-randomization-process more efficient -- one that would scale better as the size of my input files increases?

Upvotes: 3

Answers (3)

Jeremy Friesner

Reputation: 73209

Thanks to advice of various people in this thread (in particular Marc Glisse and Andrew Henle) I was able to reduce the execution time of my program on a 13GB input file, from ~16 minutes to ~2 minutes. I'll document how I did it in this answer, since the solution wasn't very much like either of the answers above (it was more based on Marc's comment, so I'll give Marc the checkbox if/when he restates his comment as an answer).

I tried replacing the mmap() strategy with pread(), but that didn't seem to make much difference; and I tried passing F_NOCACHE and various other flags to fcntl(), but they seemed to either have no effect or make things slower, so I decided to try a different approach.

The new approach is to do things in a 2-layer fashion: rather than reading in single records at a time, my program now loads in "blocks" of sequential records from the input file (each block containing around 4MB of data).

The blocks are loaded in random order, and I load in blocks until I have a certain amount of block-data held in RAM (currently ~4GB, as that is what my Mac's RAM can comfortably hold). Then I start grabbing random records out of random in-RAM blocks, and writing them to the output file. When a given block no longer has any records left in it to grab, I free that block and load in another block from the input file. I repeat this until all blocks from the input file have been loaded and all their records distributed to the output file.

This is faster because all of my output is strictly sequential, and my input is mostly sequential (i.e. 4MB of data is read after each seek rather than only ~2kB). The ordering of the output is slightly less random than it was, but I don't think that will be a problem for me.

Upvotes: 0

Andrew Henle

Reputation: 1

I'd recommend not using mmap() - there's no way all that memory pressure is any help at all, and unless you're re-reading the same data multiple times, mmap() is often the worst-performing way to read data.

First, generate your N random offsets, then, given those offsets, use pread() to read the data - and use low-level C-style IO.

This uses the fcntl() function to disable the page cache for your file. Since you're not re-reading the same data, the page cache likely does you little good, but it does use up RAM, slowing other things down. Try it both with and without the page cache disabled and see which is faster. Note also that I've left out all error checking:

(I'm also assuming C-style IO functions are in namespace std on a MAC, and I've used C-style strings and arrays to match the C-style IO functions while keeping the code simpler.)

#include <sys/types.h>
#include <sys/uio.h>
#include <unistd.h>
#include <fcntl.h>

void sendRecords( const char *dataFile, off_t offsets, size_t numOffsets )
{
    int fd = std::open( dataFile, O_RDONLY );
    // try with and without this
    std::fcntl( fd, F_NOCACHE, 1 );

    // can also try using page-aligned memory here
    char data[ RECORD_LENGTH ];

    for ( size_t ii = 0; ii < numOffsets; ii++ )
    {
        ssize_t bytesRead = std::pread( fd, data, sizeof( data ), offsets[ ii ] );
        // process this record
        processRecord( data );
    }

    close( datafd );
}

Assuming you have a file containing precalculated random offsets:

#include <sys/types.h>
#include <sys/uio.h>
#include <unistd.h>
#include <fcntl.h>

void sendRecords( const char *dataFile, const char *offsetFile )
{
    int datafd = std::open( dataFile, O_RDONLY );
    // try with and without this
    std::fcntl( fd, F_NOCACHE, 1 );

    int offsetfd = std::open( offsetFile, O_RDONLY );

    // can also try using page-aligned memory here
    char data[ RECORD_LENGTH ];

    for ( ;; )
    {
        off_t offset;
        ssize_t bytesRead = std::read( offsetfd, &offset, sizeof( offset ) );
        if ( bytesRead != sizeof( offset ) )
        {
            break;
        }
        bytesRead = std::pread( fd, data, sizeof( data ), offset );
        // process this record
        processRecord( data );
    }

    std::close( datafd );
    std::close( offsetfd );
}

You can go faster, too, since that code alternates reading and processing, and it'd probably be faster to use multiple threads to read and process simultaneously. It's not that hard to use one or more threads to read data into preallocated buffers that you then queue up and send to your processing thread.

Upvotes: 1

Victor Istomin

Reputation: 1156

If problem is related to swapping and random disk access while reading random file locations, can you at least read input file sequentially?

When you're accessing some chunk in mmap-ed file, prefetcher will think that you'll need adjacent pages soon, so it will also load them. But you won't, so these pages will be discarded and loading time will be wasted.

create array of N toPositons, so toPosition[i]=i;
randomize destinations (are you using knuth's shuffle?);
then toPosition[i] = destination of input[i]. So, read input data sequentially from start and place them into corresponding place of destination file.

Perhaps, this will be more prefetcher-friendly. Of course, writing data randomly is slow too, but at least, you won't waste prefetched pages from input file.

Additional benefit is that when you've processed few millions of input data pages, these GBs will be unloaded from RAM and you'll never need them again, so you won't pollute actual disk cache. Remember that actual memory page size is at least 4K, so even when you're randomly accessing 1 byte of mmap-ed file, at least 4K of data should be read from disk into cache.

Upvotes: 1

What&#39;s an efficient way to randomize the ordering of the contents of a very large file?

Answers (3)

Related Questions

What's an efficient way to randomize the ordering of the contents of a very large file?