eggbert
eggbert

Reputation: 3233

compressing a string in C# and uncompressing in python

I am trying to compress a large string on a client program in C# (.net 4) and send it to a server (django, python 2.7) using a PUT request. Ideally I want to use the standard library at both ends, so I am trying to use gzip.

My C# code is:

public static string Compress(string s) {
    var bytes = Encoding.Unicode.GetBytes(s);
    using (var msi = new MemoryStream(bytes))
    using (var mso = new MemoryStream()) {
        using (var gs = new GZipStream(mso, CompressionMode.Compress)) {
            msi.CopyTo(gs);
        }
        return Convert.ToBase64String(mso.ToArray());
    }
}

The python code is:

s = base64.standard_b64decode(request)
buff = cStringIO.StringIO(s)

with gzip.GzipFile(fileobj=buff) as gz:
    decompressed_data = gz.read()

It's almost working, but the output is: {▯"▯c▯h▯a▯n▯g▯e▯d▯"▯} when it should be {"changed"}, i.e. every other letter is something weird. If I take out every other character by doing decompressed_data[::2], then it works, but it's a bit of a hack, and clearly there is something else wrong.

I'm wondering if I need to base64 encode it at all for a PUT request? Is this only necessary for POST?

Upvotes: 4

Views: 2059

Answers (2)

Paulo Bu
Paulo Bu

Reputation: 29794

I think the main problem might be C# uses UTF-16 encoded strings. This may yield a problem similar to yours. As any other encoding problem, we might need a little luck here but I guess you can solve this by doing:

decompressed_data = gz.read().decode('utf-16')

There, decompressed_data should be Unicode and you can treat it as such for further work.

UPDATE: This worked for me:

C Sharp

static void Main(string[] args)
    {
        FileStream f = new FileStream("test", FileMode.CreateNew);
        using (StreamWriter w = new StreamWriter(f))
        {
            w.Write(Compress("hello"));
        }
    }
    public static string Compress(string s)
    {
        var bytes = Encoding.Unicode.GetBytes(s);
        using (var msi = new MemoryStream(bytes))
        using (var mso = new MemoryStream())
        {
            using (var gs = new GZipStream(mso, CompressionMode.Compress))
            {
                msi.CopyTo(gs);
            }
            return Convert.ToBase64String(mso.ToArray());
        }
    }

Python

import base64
import cStringIO
import gzip

f = open('test','rb')
s = base64.standard_b64decode(f.read())
buff = cStringIO.StringIO(s)

with gzip.GzipFile(fileobj=buff) as gz:
    decompressed_data = gz.read()
    print decompressed_data.decode('utf-16')

Without decode('utf-16) it printed in the console:

>>>h e l l o

with it it did well:

>>>hello

Good luck, hope this helps!

Upvotes: 4

Jon Skeet
Jon Skeet

Reputation: 1500375

It's almost working, but the output is: {▯"▯c▯h▯a▯n▯g▯e▯d▯"▯} when it should be {"changed"}

That's because you're using Encoding.Unicode to convert the string to bytes to start with.

If you can tell Python which encoding to use, you could do that - otherwise you need to use an encoding on the C# side which matches what Python expects.

If you can specify it on both sides, I'd suggest using UTF-8 rather than UTF-16. Even though you're compressing, it wouldn't hurt to make the data half the size (in many cases) to start with :)

I'm also somewhat suspicious of this line:

buff = cStringIO.StringIO(s)

s really isn't text data - it's compressed binary data, and should be treated as such. It may be okay - it's just worth checking whether there's a better way.

Upvotes: 2

Related Questions