Why are my TCP transfers corrupted on cygwin?

Question

I am trying to debug why my TCP transfers are corrupted when sent from Cygwin. I see that only the first 24 bytes of each structure are showing up in my server program running on Centos. The 25th through 28th bytes are scrambled and all others after that are zeroed out. Going in the other direction, receiving from Centos on Cygwin, again only the first 24 bytes of each block are showing up in my server program running on Cygwin. The 25th through 40th bytes are scrambled and all others after that are zeroed out. I also see the issue when sending or receiving to/from localhost on Cygwin. For localhost, the first 34 bytes are correct and all after that are zeroed out.

The application I am working on work fine on Centos4 talking to Centos and I am trying to port it to Cygwin. Valgrind reports no issues on Centos, I do not have Valgrind running on Cygwin. Both platforms are little-endian x86.

I've run Wireshark on the host Windows XP system under which Cygwin is running. When I sniff the packets with Wireshark they look perfect, for both sent packets from Cygwin and received packets to Cygwin.

Somehow, the data is corrupted between the level Wireshark looks at and the program itself.

The C++ code uses ::write(fd, buffer, size) and ::read(fd, buffer, size) to write and read the TCP packets where fd is a file descriptor for the socket that is opened between the client and server. This code works perfectly on Centos4 talking to Centos.

The strangest thing to me is that the packet sniffer shows the correct complete packet for all cases, yet the cygwin application never reads the complete packet or in the other direction, the Centos application never reads the complete packet.

Can anyone suggest how I might go about debugging this?

Here is some requested code:

size_t
read_buf(int fd, char *buf, size_t count, bool &eof, bool immediate)
{
  if (count > SSIZE_MAX) {
    throw;
  }

  size_t want = count;
  size_t got = 0;

  fd_set readFdSet;
  int fdMaxPlus1 = fd + 1;

  FD_ZERO(&readFdSet);
  FD_SET(fd, &readFdSet);

  while (got < want) {
    errno = 0;

    struct timeval timeVal;
    const int timeoutSeconds = 60;

    timeVal.tv_usec = 0;
    timeVal.tv_sec = immediate ? 0 : timeoutSeconds;

    int selectReturn = ::select(fdMaxPlus1, &readFdSet, NULL, NULL, &timeVal);

    if (selectReturn < 0) {
      throw;
    }

    if (selectReturn == 0 || !FD_ISSET(fd, &readFdSet)) {
      throw;
    }

    errno = 0;

    // Read buffer of length count.
    ssize_t result = ::read(fd, buf, want - got);

    if (result < 0) {
      throw;
    } else {
      if (result != 0) {
        // Not an error, increment the byte counter 'got' & the read pointer,
        // buf.
        got += result;
        buf += result;
      } else { // EOF because zero result from read.
        eof = true;
        break;
      }
    }
  }
  return got;
}

I've discovered more about this failure. The C++ class where the packet is being read into is laid out like this:

unsigned char _array[28];
long long _sequence;
unsigned char _type;
unsigned char _num;
short _size;

Apparently, the long long is getting scrambled with the four bytes that follow.

The C++ memory sent by Centos application, starting with _sequence, in hex, looks like this going to write():

_sequence: 45 44 35 44 33 34 43 45
    _type: 05
     _num: 33
    _size: 02 71

Wireshark shows the memory laid out in network big-endian format like this in the packet:

_sequence: 45 43 34 33 44 35 44 45
    _type: 05
     _num: 33
    _size: 71 02

But, after read() in the C++ cygwin little-endian application, it looks like this:

_sequence: 02 71 33 05 45 44 35 44
    _type: 00
     _num: 00
    _size: 00 00

I'm stumped as to how this is occurring. It seems to be an issue with big-endian and little-endian, but the two platforms are both little-endian.

Here _array is 7 ints instead of 28 chars.

Complete memory dump at sender:

_array[0]: 70 a2 b7 cf
_array[1]: 9b 89 41 2c
_array[2]: aa e9 15 76
_array[3]: 9e 09 b6 e2
_array[4]: 85 49 08 81
_array[5]: bd d7 9b 1e
_array[6]: f2 52 df db
_sequence: 41 41 31 35 32 43 38 45
    _type: 05
     _num: 45
    _size: 02 71

And at receipt:

_array[0]: 70 a2 b7 cf
_array[1]: 9b 89 41 2c
_array[2]: aa e9 15 76
_array[3]: 9e 09 b6 e2
_array[4]: 85 49 08 81
_array[5]: bd d7 9b 1e
_array[6]: f2 52 df db
_sequence: 02 71 45 05 41 41 31 35
    _type: 0
     _num: 0
    _size: 0

Cygwin test result:

4
8
48
0x22be08
0x22be28
0x22be31
0x22be32
0x22be38

Centos test result:

4
8
40
0xbfffe010
0xbfffe02c
0xbfffe035
0xbfffe036
0xbfffe038

Ben Voigt · Accepted Answer

Now that you've shown the data, your problem is clear. You're not controlling the alignment of your struct, so the compiler is automatically putting the 8 byte field (the long long) on an 8 byte boundary (offset 32) from the start of the struct, leaving 4 bytes of padding.

Change the alignment to 1 byte and everything should resolve. Here's the snippet you need:

__attribute__ ((aligned (1))) __attribute ((packed))

I also suggest that you use the fixed-size types for structures being blitted across the network, e.g. uint8_t, uint32_t, uint64_t

Previous thoughts:

With TCP, you don't read and write packets. You read and write from a stream of bytes. Packets are used to carry these bytes, but boundaries are not preserved.

Your code looks like it deals with this reasonably well, you might want to update the wording of your question.

Why are my TCP transfers corrupted on cygwin?

Answers (2)

Related Questions