Reputation: 6948
I have some .gz compressed files which is around 5-7gig uncompressed. These are flatfiles.
I've written a program that takes a uncompressed file, and reads it line per line, which works perfectly.
Now I want to be able to open the compressed files inmemory and run my little program.
I've looked into zlib but I can't find a good solution.
Loading the entire file is impossible using gzread(gzFile,void *,unsigned), because of the 32bit unsigned int limitation.
I've tried gzgets, but this almost doubles the execution time, vs reading in using gzread.(I tested on a 2gig sample.)
I've also looked into "buffering", such as splitting the gzread process into multiple 2gig chunks, find the last newline using strcchr, and then setting the gzseek. But gzseek will emulate a total file uncompression. which is very slow.
I fail to see any sane solution to this problem. I could always do some checking, whether or not a current line actually has a newline (should only occure in the last partially read line), and then read more data from the point in the program where this occurs. But this could get very ugly.
Does anyhow have any suggestions?
thanks
edit: I dont need to have the entire file at once,just need one line a time, but I got a fairly huge machine, so if that was the easiest I would have no problems.
For all those that suggest piping the stdin, I've experienced extreme slowdowns compared to opening the file. Here is a small code snippet I made some months ago, that illustrates it.
time ./a.out 59846/59846.txt
# 59846/59846.txt
18255221
real 0m4.321s
user 0m2.884s
sys 0m1.424s
time ./a.out <59846/59846.txt
18255221
real 1m56.544s
user 1m55.043s
sys 0m1.512s
And the source code
#include <iostream>
#include <fstream>
#define LENS 10000
int main(int argc, char **argv){
std::istream *pFile;
if(argc==2)//ifargument supplied
pFile = new std::ifstream(argv[1],std::ios::in);
else //if we want to use stdin
pFile = &std::cin;
char line[LENS];
if(argc==2) //if we are using a filename, print it.
printf("#\t%s\n",argv[1]);
if(!pFile){
printf("Do you have permission to open file?\n");
return 0;
}
int numRow=0;
while(!pFile->eof()) {
numRow++;
pFile->getline(line,LENS);
}
if(argc==2)
delete pFile;
printf("%d\n",numRow);
return 0;
}
thanks for your replies, I'm still waiting the golden apple
edit2: using the cstyle FILE pointers instead of c++ streams is much much faster. So I think this is the way to go.
Thank for all your input
Upvotes: 1
Views: 4891
Reputation: 47640
gzip -cd compressed.gz | yourprogram
just go ahead and read it line by line from stdin as it is uncompressed.
EDIT: Response to your remarks about performance. You're saying reading STDIN line by line is slow compared to reading an uncompressed file directly. The difference lies within terms of buffering. Normally pipe will yield to STDIN as soon as the output becomes available (no, or very small buffering there). You can do "buffered block reads" from STDIN and parse the read blocks yourself to gain performance.
You can achieve the same result with possibly better performance by using gzread()
as well. (Read a big chunk, parse the chunk, read the next chunk, repeat)
Upvotes: 5
Reputation: 15118
I don't know if this will be an answer to your question, but I believe it's more than a comment:
Some months ago I discovered that the contents of Wikipedia can be downloaded in much the same way as the StackOverflow data dump. Both decompress to XML.
I came across a description of how the multi-gigabyte compressed dump file could be parsed. It was done by Perl scripts, actually, but the relevant part for you was that Bzip2 compression was used.
Bzip2 is a block compression scheme, and the compressed file could be split into manageable pieces, and each part uncompressed individually.
Unfortunately, I don't have a link to share with you, and I can't suggest how you would search for it, except to say that it was described on a Wikipedia 'data dump' or 'blog' page.
EDIT: Actually, I do have a link
Upvotes: 0
Reputation: 229088
gzread only reads chunks of the file, you loop on it as you would using a normal read() call.
Do you need to read the entire file into memory ?
If what you need is to read lines, you'd gzread() a sizable chunk(say 8192 bytes) into a buffer, loop through that buffer and find all '\n' characters and process those as individual lines. You'd have to save the last piece incase there is just part of a line, and prepend that to the data you read next time.
You could also read from stdin and invoke your app like
zcat bigfile.gz | ./yourprogram
in which case you can use fgets and similar on stdin. This is also beneficial in that you'd run decompression on one processor and processing the data on another processor :-)
Upvotes: 5