Reputation: 2456
char buffer[1001];
for(;!gzeof(m_fHandle);){
gzread(m_fHandle, buffer, 1000);
The file I'm handling is more than 1GB.
do I load the entire file to the buffer? or should I malloc and allocate the size?
Or should I load it line by line? the file has a "\n" demarkating the EOL. if so, how do I do that for handling gzfile in c++?
Upvotes: 2
Views: 6372
Reputation: 393084
The zlib
approach would be:
You can just call gzread
with a limited buffer size repeatedly. If you can be sure that he max line length is eg BUFLEN
: See it Live On Coliru
#include <zlib.h>
#include <iostream>
#include <algorithm>
static const unsigned BUFLEN = 1024;
void error(const char* const msg)
{
std::cerr << msg << "\n";
exit(255);
}
void process(gzFile in)
{
char buf[BUFLEN];
char* offset = buf;
for (;;) {
int err, len = sizeof(buf)-(offset-buf);
if (len == 0) error("Buffer to small for input line lengths");
len = gzread(in, offset, len);
if (len == 0) break;
if (len < 0) error(gzerror(in, &err));
char* cur = buf;
char* end = offset+len;
for (char* eol; (cur<end) && (eol = std::find(cur, end, '\n')) < end; cur = eol + 1)
{
std::cout << std::string(cur, eol) << "\n";
}
// any trailing data in [eol, end) now is a partial line
offset = std::copy(cur, end, buf);
}
// BIG CATCH: don't forget about trailing data without eol :)
std::cout << std::string(buf, offset);
if (gzclose(in) != Z_OK) error("failed gzclose");
}
int main()
{
process(gzopen("test.gz", "rb"));
}
If you cannot know the maximum line size, I'd suggest abstracting it a bit more and deriving from std::basic_streambuf
overriding underflow
so you can use std::getline
with an istream
based on this buffer.
UPDATE Since you're new to C++, implementing your own streambuf
is likely not a good idea. I recommend using a c++ library (instead of zlib).
E.g. Boost Iostream allows you to simply do this:
#include <boost/iostreams/device/file.hpp>
#include <boost/iostreams/filtering_stream.hpp>
#include <boost/iostreams/filter/gzip.hpp>
namespace io = boost::iostreams;
int main()
{
io::filtering_istream in;
in.push(io::gzip_decompressor());
in.push(io::file_source("my_file.txt"));
// read from in using std::istream interface
std::string line;
while (std::getline(in, line, '\n'))
{
process(line); // your code :)
}
}
Upvotes: 4
Reputation: 1797
You say this is a gzfile. That implies a binary format where '\n' is not valid for EOL (there is no concept of EOL with binary files.)
That said, in practice you have a couple choices for buffer size. Loading the entire file into memory will certainly be easier for you as a developer to work with the data. However, this is a costly solution in terms of memory consumed for the task.
If memory is a concern then you need to work on the data in pieces. There is probably an optimal amount of data to try to fetch at a time and a lot of that will depend on the hardware architecture of the machine you have all the way from the CPU through cache lines, memory bus, SATA bus, and even the drives that hold the file itself.
If this is just a onesy-twosy kind of problem you're solving and you're running this on a modern computer, 1GB is probably ok to keep in memory. Just new a uint8_t[] the size of the file and read the whole thing in then parse the data.
Otherwise, you need to integrate your parsing of the file with the reading of the file.
Upvotes: 0