intrigued_66
intrigued_66

Reputation: 17258

Read a file like an array without copying in to an array?

I have ASCII files which are 100 to 400 MBs in size.

I'd like to read them byte-by-byte, as if I was reading an array. So I could access each byte like if (file[pos] == \n) etc.

However, I have thousands of these files and I presume it would be expensive to copy each one in to an array.

Is it possible to read the files like they were an array, without explicitly copying to an array, and avoiding mmap/only use standard C++?

Upvotes: 0

Views: 116

Answers (1)

Miles Budnek
Miles Budnek

Reputation: 30619

Sadly the standard library doesn't provide a standardized way to memory-map files, so you'll need to either use the OS-provided APIs (mmap() on POSIX OSes and CreateFileMapping()/MapViewOfFile() on Windows) or use a library like boost-interprocess that wraps the OS APIs to provide a cross-platform interface.

For instance, with Boost you could do something like this:

const char* file_name = "my_file.txt";
boost::interprocess::file_mapping mapping(file_name, boost::interprocess::read_write);
boost::interprocess::mapped_region region(mapping, boost::interprocess::read_write);

char* file = static_cast<char*>(region.get_address());
for (int pos = 0; pos < region.get_size(); ++i) {
    if (file[pos] == '\n') {
        // ...
    } else {
        // ...
    }
}
// Or do any sort of random access
file[some_position] = '!';

Alternatively if you really want to use only the standard library you could fake random access by writing a class that overloads operator[] to read a chunk of a file into memory. Something like this:

class FakeMappedFile
{
private:
    std::fstream stream_;
    std::vector<char> buffer_;

    static const std::size_t BUF_SIZE = 1024;

    void read_chunk(std::size_t pos)
    {
        buffer_.resize(BUF_SIZE);
        stream_.seekg(pos);
        stream_.read(buffer_.data(), BUF_SIZE);
        if (!stream_) {
            buffer_.resize(stream_.gcount());
            stream_.clear();
        }
        stream_.seekg(pos);
    }

public:
    FakeMappedFile(const std::filesystem::path& file_name)
        : stream_{file_name}
    {
        read_chunk(0);
    }

    ~FakeMappedFile()
    {
        flush();
    }

    std::size_t size()
    {
        auto saved_pos = stream_.tellg();
        stream_.seekg(0, std::ios::end);
        auto end_pos = stream_.tellg();
        stream_.seekg(saved_pos);
        return end_pos;
    }

    void flush()
    {
        auto pos = stream_.tellg();
        stream_.seekp(pos);
        stream_.write(buffer_.data(), buffer_.size());
    }

    char& operator[](std::size_t pos)
    {
        if (
            std::size_t current_pos = stream_.tellg();
            pos < current_pos || pos >= current_pos + BUF_SIZE
        ) {
            flush();
            read_chunk(pos);
        }
        return buffer_[pos - stream_.tellg()];
    }
};

Demo

This provides fake random access to a file by reading it in 1KB chunks and writing the data back to the file any time it reads a new chunk. It is fairly inefficient since it flushes the data back to disk even if you didn't modify it, but it shows the idea. You could use a wrapper class around the returned char& to detect writes and only write the data back to disk it's been changed or simply remove the flush function entirely if you don't need to be able to modify the file.

Upvotes: 2

Related Questions