Software Engineer
Software Engineer

Reputation: 3956

C - Decompress Gzipped http response

I have few issues in decompressing gzipped http response, I separated data part from headers but its gzip header and message contain \0 characters which char * takes as null terminator so the first question is how to extract gzipped chunk ?

I can't use string functions like strcat, strlen because it is compressed gzipped data that contains \0 character at various places within chunk.

I've used libcurl but it is relatively slower than C sockets.

Here is some part of a sample response:

HTTP/1.1 200 OK
Cache-Control: private, max-age=0
Content-Type: text/html; charset=utf-8
P3P: CP="NON UNI COM NAV STA LOC CURa DEVa PSAa PSDa OUR IND"
Vary: Accept-Encoding
Content-Encoding: gzip
Content-Length: 12605
Date: Mon, 05 Mar 2012 11:46:30 GMT
Connection: keep-alive
Set-Cookie: _FP=EM=1; expires=Wed, 05-Mar-2014 11:46:29 GMT; domain=.bing.com; path=/

����ՠ����AU��o�

Sample code:

#define MAXDATASIZE 1024

char *recvData; // Holds entire gzip data
char recvBuff[MAXDATASIZE]; // Holds gzip chunk
int offset=0;
while(1){
    recvBytes = recv(sockfd, &recvBuff, MAXDATASIZE-1, 0);
    totalRecvBytes += recvBytes;

    // get content length, this runs first time only as required
    if(!clfnd){
        regi = regexec(&clregex, &recvBuff, 3, clmatch, 0);
        if(!regi){
            strncpy(clarr, recvBuff + clmatch[2].rm_so, clmatch[2].rm_eo-clmatch[2].rm_so);
            clarr[clmatch[2].rm_eo-clmatch[2].rm_so] = '\0';
            cl = atoi(clarr);
            clfnd=1;
            regfree(&clregex);
            recvData = malloc(cl * sizeof(char));
            memset(recvData, 0, sizeof recvData);
        }
    }

    // get data part from 1st iteration, furthur iterations contain only data
    if(!datasplit){
        int strtidx;
        char *datastrt = strstr(&recvBuff, "\r\n\r\n");
        if(datastrt != NULL){
            strtidx = datastrt - recvBuff + 4;
            memcpy(recvData, recvBuff + strtidx, recvBytes-strtidx);
            datasplit=1;
            offset = recvBytes-strtidx;
        }
    }
    else{
        memcpy(recvData + offset, recvBuff, recvBytes);
        offset += recvBytes;
    }
    if (offset >= cl)
        break;
}

char *outData = malloc(offset*4 * sizeof(char));
memset(outData, 0, sizeof outData);
int ret = inf(recvData, offset, outData, offset*4);

Inflate function:

int inf(const char *src, int srcLen, const char *dst, int dstLen){
z_stream strm;
strm.zalloc=NULL;
strm.zfree=NULL;
strm.opaque=NULL;

strm.avail_in = srcLen;
strm.avail_out = dstLen;
strm.next_in = (Bytef *)src;
strm.next_out = (Bytef *)dst;

int err=-1, ret=-1;
err = inflateInit2(&strm, MAX_WBITS+16);
if (err == Z_OK){
    err = inflate(&strm, Z_FINISH);
    if (err == Z_STREAM_END){
        ret = strm.total_out;
    }
    else{
        inflateEnd(&strm);
        return err;
    }
}
else{
    inflateEnd(&strm);
    return err;
}
inflateEnd(&strm);
printf("%s\n", dst);
return err;
}

Upvotes: 2

Views: 5458

Answers (3)

Bruno Soares
Bruno Soares

Reputation: 796

The beginning of the HTTP payload starts after the "\r\n\r\n" (after the HTTP header).

Use the HTTP field "Content-Length" to get the HTTP payload size.

With this information, you have to create function to decompress de data. With Zlib you can do that.

PS. pay attention if it's using raw format or zlib with header and trailers. Usualy HTTP uses header and trailer, and IMAP4 uses raw format.

Upvotes: 0

snibu
snibu

Reputation: 581

Content-Length: 12605 means that the gzipped file has a size of 12605 bytes. So just copy 12605 bytes after the message header to a local buffer and give that buffer to the decompression function. Also I am not sure if your socket reading function reads the whole 12605 in one flow. If not, you need to append the rest of the data in the next read to this local buffer and when 12605 bytes are read and then call the decompression function. There is no problem in using char* as buffer. The issue ur facing is because ur trying to print the gzip data as string.

Upvotes: 2

harald
harald

Reputation: 6126

No, the type char * says nothing about the contents it points to, nor does it interpret any value as a terminator. The str* functions, on the other hand has an assumption about how strings are represented, and can not be used on binary data, or even text data that has a different representation.

Decompression can be rather complex, but you can have a look at zlib, whcih should be able to help you out.

Upvotes: 4

Related Questions