pistal
pistal

Reputation: 2456

Handling large gzfile in c++

char buffer[1001];
for(;!gzeof(m_fHandle);){ 
         gzread(m_fHandle, buffer, 1000);
     The file I'm handling is more than 1GB.

do I load the entire file to the buffer? or should I malloc and allocate the size?

Or should I load it line by line? the file has a "\n" demarkating the EOL. if so, how do I do that for handling gzfile in c++?

Upvotes: 2

Views: 6372

Answers (2)

sehe
sehe

Reputation: 393084

The zlib approach would be:

You can just call gzread with a limited buffer size repeatedly. If you can be sure that he max line length is eg BUFLEN: See it Live On Coliru

#include <zlib.h>
#include <iostream>
#include <algorithm>

static const unsigned BUFLEN = 1024;

void error(const char* const msg)
{
    std::cerr << msg << "\n";
    exit(255);
}

void process(gzFile in)
{
    char buf[BUFLEN];
    char* offset = buf;

    for (;;) {
        int err, len = sizeof(buf)-(offset-buf);
        if (len == 0) error("Buffer to small for input line lengths");

        len = gzread(in, offset, len);

        if (len == 0) break;    
        if (len <  0) error(gzerror(in, &err));

        char* cur = buf;
        char* end = offset+len;

        for (char* eol; (cur<end) && (eol = std::find(cur, end, '\n')) < end; cur = eol + 1)
        {
            std::cout << std::string(cur, eol) << "\n";
        }

        // any trailing data in [eol, end) now is a partial line
        offset = std::copy(cur, end, buf);
    }

    // BIG CATCH: don't forget about trailing data without eol :)
    std::cout << std::string(buf, offset);

    if (gzclose(in) != Z_OK) error("failed gzclose");
}

int main()
{
    process(gzopen("test.gz", "rb"));
}

If you cannot know the maximum line size, I'd suggest abstracting it a bit more and deriving from std::basic_streambuf overriding underflow so you can use std::getline with an istream based on this buffer.

UPDATE Since you're new to C++, implementing your own streambuf is likely not a good idea. I recommend using a c++ library (instead of zlib).

E.g. Boost Iostream allows you to simply do this:

Live On Coliru

#include <boost/iostreams/device/file.hpp>
#include <boost/iostreams/filtering_stream.hpp>
#include <boost/iostreams/filter/gzip.hpp>    

namespace io = boost::iostreams;

int main()
{   
    io::filtering_istream in;
    in.push(io::gzip_decompressor());
    in.push(io::file_source("my_file.txt"));
    // read from in using std::istream interface

    std::string line;
    while (std::getline(in, line, '\n'))
    {
         process(line); // your code :)
    }
}

Upvotes: 4

Lother
Lother

Reputation: 1797

You say this is a gzfile. That implies a binary format where '\n' is not valid for EOL (there is no concept of EOL with binary files.)

That said, in practice you have a couple choices for buffer size. Loading the entire file into memory will certainly be easier for you as a developer to work with the data. However, this is a costly solution in terms of memory consumed for the task.

If memory is a concern then you need to work on the data in pieces. There is probably an optimal amount of data to try to fetch at a time and a lot of that will depend on the hardware architecture of the machine you have all the way from the CPU through cache lines, memory bus, SATA bus, and even the drives that hold the file itself.

If this is just a onesy-twosy kind of problem you're solving and you're running this on a modern computer, 1GB is probably ok to keep in memory. Just new a uint8_t[] the size of the file and read the whole thing in then parse the data.

Otherwise, you need to integrate your parsing of the file with the reading of the file.

Upvotes: 0

Related Questions