Reputation: 13896
I conceived this idea to merge an arbitrary number of small text file into 1 single zip file with GZipStream class. I spent several nights to make it work, but the outcome is that the final zip file ended up being bigger than if the text files had concatenated together. I vaguely know how Huffman coding works, so I don't know if it's practical to do this, or if there's a better alternative. Ultimately, I want an external sorted index file to map out each blob for fast access. What do you think?
// keep track of index current position
long indexByteOffset = 0;
// in reality the blobs vary in size from 1k to 300k bytes
string[] originalData = { "data blob1", "data blob2", "data blob3", "data blob4" /* etc etc etc */};
// merged compressed file
BinaryWriter zipWriter = new BinaryWriter(File.Create(@"c:\temp\merged.gz"));
// keep track of begining position and size of each blob
StreamWriter indexWriter = new StreamWriter(File.Create(@"c:\temp\index.txt"));
foreach(var blob in originalData){
using(MemoryStream ms = new MemoryStream()){
using(GZipStream zipper = new GZipStream(ms, CompressionMode.Compress)){
Encoding utf8Encoder = new UTF8Encoding();
byte[] encodeBuffer = utf8Encoder.GetBytes(blob);
zipper.Write(encodeBuffer, 0, encodeBuffer.Length);
}
byte[] compressedData = ms.ToArray();
zipWriter.Write(compressedData);
zipWriter.Seek(0, SeekOrigin.End);
indexWriter.WriteLine(indexByteOffset + '\t' + (indexByteOffset + compressedData.Length));
indexByteOffset += compressedData.Length;
}
}
Upvotes: 0
Views: 227
Reputation: 1062600
Different data can compress with different effectiveness. Small data usually isn't worth trying to compress. One common approach is to allow for an "is it compressed?" flag - do a speculative compress, but if it is larger store the original. That information could be included in the index. Personally, though, I'd probably be tempted to go for a single file - either a .zip, or just including the length of each fragment as a 4-byte chunk (or maybe a "varint") before each - then seeking to the n-th fragment is just a case of "read length prefix, decode as int, seek that many bytes, repeat". You could also reserve one bit of that for "is it compressed".
But as for "is it worth compression": that depends on your data.
Upvotes: 1