Reputation: 17258
I have ASCII files which are 100 to 400 MBs in size.
I'd like to read them byte-by-byte, as if I was reading an array. So I could access each byte like if (file[pos] == \n)
etc.
However, I have thousands of these files and I presume it would be expensive to copy each one in to an array.
Is it possible to read the files like they were an array, without explicitly copying to an array, and avoiding mmap
/only use standard C++?
Upvotes: 0
Views: 116
Reputation: 30619
Sadly the standard library doesn't provide a standardized way to memory-map files, so you'll need to either use the OS-provided APIs (mmap()
on POSIX OSes and CreateFileMapping()
/MapViewOfFile()
on Windows) or use a library like boost-interprocess that wraps the OS APIs to provide a cross-platform interface.
For instance, with Boost you could do something like this:
const char* file_name = "my_file.txt";
boost::interprocess::file_mapping mapping(file_name, boost::interprocess::read_write);
boost::interprocess::mapped_region region(mapping, boost::interprocess::read_write);
char* file = static_cast<char*>(region.get_address());
for (int pos = 0; pos < region.get_size(); ++i) {
if (file[pos] == '\n') {
// ...
} else {
// ...
}
}
// Or do any sort of random access
file[some_position] = '!';
Alternatively if you really want to use only the standard library you could fake random access by writing a class that overloads operator[]
to read a chunk of a file into memory. Something like this:
class FakeMappedFile
{
private:
std::fstream stream_;
std::vector<char> buffer_;
static const std::size_t BUF_SIZE = 1024;
void read_chunk(std::size_t pos)
{
buffer_.resize(BUF_SIZE);
stream_.seekg(pos);
stream_.read(buffer_.data(), BUF_SIZE);
if (!stream_) {
buffer_.resize(stream_.gcount());
stream_.clear();
}
stream_.seekg(pos);
}
public:
FakeMappedFile(const std::filesystem::path& file_name)
: stream_{file_name}
{
read_chunk(0);
}
~FakeMappedFile()
{
flush();
}
std::size_t size()
{
auto saved_pos = stream_.tellg();
stream_.seekg(0, std::ios::end);
auto end_pos = stream_.tellg();
stream_.seekg(saved_pos);
return end_pos;
}
void flush()
{
auto pos = stream_.tellg();
stream_.seekp(pos);
stream_.write(buffer_.data(), buffer_.size());
}
char& operator[](std::size_t pos)
{
if (
std::size_t current_pos = stream_.tellg();
pos < current_pos || pos >= current_pos + BUF_SIZE
) {
flush();
read_chunk(pos);
}
return buffer_[pos - stream_.tellg()];
}
};
This provides fake random access to a file by reading it in 1KB chunks and writing the data back to the file any time it reads a new chunk.
It is fairly inefficient since it flushes the data back to disk even if you didn't modify it, but it shows the idea. You could use a wrapper class around the returned char&
to detect writes and only write the data back to disk it's been changed or simply remove the flush
function entirely if you don't need to be able to modify the file.
Upvotes: 2