Richard
Richard

Reputation: 61259

Read and write array to compressed file with boost iostreams

I want to write an array to a file, compressing it as I go.

Later, I want to read the array from that file, decompressing it as I go.

Boost's Iostreams seems like a good way to go, so I built the following code. Unfortunately, the output and input data do not compare equal at the end. But they very nearly do:

Output         Input
0.8401877284   0.8401880264
0.3943829238   0.3943830132
0.7830992341   0.7830989957
0.7984400392   0.7984399796
0.9116473794   0.9116470218
0.1975513697   0.1975509971
0.3352227509   0.3352229893

This suggests that the least significant byte of each float is getting changed, or something. The compression should be lossless, though, so this is not expected or desired. What gives?

//Compile with: g++ test.cpp --std=c++11 -lz -lboost_iostreams
#include <fstream>
#include <iostream>
#include <boost/iostreams/filtering_stream.hpp>
#include <boost/iostreams/filter/zlib.hpp>
#include <cstdlib>
#include <vector>
#include <iomanip>

int main() 
{
    using namespace std;
    using namespace boost::iostreams;

    const int NUM = 10000;

    std::vector<float> data_out;
    std::vector<float> data_in;
    data_in.resize(NUM);
    for(float i=0;i<NUM;i++)
      data_out.push_back(rand()/(float)RAND_MAX);

    {
      ofstream file("/z/hello.z", ios_base::out | ios_base::binary);
      filtering_ostream out;
      out.push(zlib_compressor());
      out.push(file);

      for(const auto d: data_out)
        out<<d;
    }

    {
      ifstream file_in("hello.z", ios_base::in | ios_base::binary);
      filtering_istream in;
      in.push(zlib_decompressor());
      in.push(file_in);

      for(float i=0;i<NUM;i++)
        in>>data_in[i];
    }

    bool all_good=true;
    for(int i=0;i<NUM;i++){
      cout<<std::setprecision(10)<<data_out[i]<<"   "<<data_in[i]<<endl;
      all_good &= (data_out[i]==data_in[i]);
    }

    cout<<"Good? "<<(int)all_good<<endl;
}

And, yes, I very much prefer to use the stream operators in the way I do, rather than pushing or pulling an entire vector block at once.

Upvotes: 5

Views: 4328

Answers (2)

Richard
Richard

Reputation: 61259

As Dan Mašek pointed out in their answer, the << stream operator I was using was converting my floating-point data into a textual representation prior to compression. For some reason, I hadn't expected this.

Using the serialization library is one way to avoid this, but would introduce additional dependencies in addition to possible overhead.

Therefore, I have used a reinterpret_cast on the floating-point data and the ostream::write() method to write the data without conversion one character at a time. Reading uses a similar method. Efficiencies could be improved by increasing the number of characters written at a time.

#include <fstream>
#include <iostream>
#include <boost/iostreams/filtering_stream.hpp>
#include <boost/iostreams/filter/zlib.hpp>
#include <cstdlib>
#include <vector>
#include <iomanip>

int main() 
{
    using namespace std;
    using namespace boost::iostreams;

    const int NUM = 10000;

    std::vector<float> data_out;
    std::vector<float> data_in;
    data_in.resize(NUM);
    for(float i=0;i<NUM;i++)
      data_out.push_back(233*(rand()/(float)RAND_MAX));

    {
      ofstream file("/z/hello.z", ios_base::out | ios_base::binary);
      filtering_ostream out;
      out.push(zlib_compressor());
      out.push(file);

      char *dptr = reinterpret_cast<char*>(data_out.data());

      for(int i=0;i<sizeof(float)*NUM;i++)
        out.write(&dptr[i],1);
    }

    {
      ifstream file_in("hello.z", ios_base::in | ios_base::binary);
      filtering_istream in;
      in.push(zlib_decompressor());
      in.push(file_in);

      char *dptr = reinterpret_cast<char*>(data_in.data());

      for(int i=0;i<sizeof(float)*NUM;i++)
        in.read(&dptr[i],1);
    }

    bool all_good=true;
    for(int i=0;i<NUM;i++){
      cout<<std::setprecision(10)<<data_out[i]<<"   "<<data_in[i]<<endl;
      all_good &= (data_out[i]==data_in[i]);
    }

    cout<<"Good? "<<(int)all_good<<endl;
}

Upvotes: 1

Dan Mašek
Dan Mašek

Reputation: 19041

The problem is not with compression, but in the way you serialize the values of the vector.

If you disable the compression and limit the size to 10 elements for easier inspection, you can see that the produced file looks something like this:

0.001251260.5635850.1933040.808740.5850090.4798730.3502910.8959620.822840.746605

As you can see, the numbers are represented as text, with a limited number of decimal places, and no separator. It is by sheer chance (since you're only working with values < 1.0) that your program was able to produce a remotely sensible result.

This happens since you use the stream operator << which formats numeric types as text.


The simplest solution would seem to be using boost::serialization to handle the reading and writing (and use boost::iostreams as the underlying compressed stream). I used a binary archive, but you could concievably use text archive as well (just replace the binary_ with text_).

Sample code:

#include <fstream>
#include <iostream>
#include <boost/iostreams/filtering_stream.hpp>
#include <boost/iostreams/filter/zlib.hpp>

#include <boost/archive/binary_oarchive.hpp>
#include <boost/archive/binary_iarchive.hpp>
#include <boost/serialization/vector.hpp>

#include <cstdlib>
#include <vector>
#include <iomanip>

int main() 
{
    using namespace std;
    using namespace boost::iostreams;

    const int NUM = 10;

    std::vector<float> data_out;
    for (float i = 0; i < NUM; i++) {
        data_out.push_back(rand() / (float)RAND_MAX);
    }

    {
        ofstream file("hello.z", ios_base::out | ios_base::binary);
        filtering_ostream out;
        out.push(zlib_compressor());
        out.push(file);

        boost::archive::binary_oarchive oa(out);
        oa & data_out;
    }

    std::vector<float> data_in;
    {
        ifstream file_in("hello.z", ios_base::in | ios_base::binary);
        filtering_istream in;
        in.push(zlib_decompressor());
        in.push(file_in);

        boost::archive::binary_iarchive ia(in);
        ia & data_in;
    }

    bool all_good=true;
    for(int i=0;i<NUM;i++){
      cout<<std::setprecision(10)<<data_out[i]<<"   "<<data_in[i]<<endl;
      all_good &= (data_out[i]==data_in[i]);
    }

    cout<<"Good? "<<(int)all_good<<endl;
}

Console Output:

0.001251258887   0.001251258887
0.563585341   0.563585341
0.1933042407   0.1933042407
0.8087404966   0.8087404966
0.5850093365   0.5850093365
0.4798730314   0.4798730314
0.3502914608   0.3502914608
0.8959624171   0.8959624171
0.822840035   0.822840035
0.7466048002   0.7466048002
Good? 1

A minor problem is that you do not serialize the size of the vector, so when reading you have to keep reading until the end of the stream.

Upvotes: 7

Related Questions