Reputation: 43099
I am trying to debug why my TCP transfers are corrupted when sent from Cygwin. I see that only the first 24 bytes of each structure are showing up in my server program running on Centos. The 25th through 28th bytes are scrambled and all others after that are zeroed out. Going in the other direction, receiving from Centos on Cygwin, again only the first 24 bytes of each block are showing up in my server program running on Cygwin. The 25th through 40th bytes are scrambled and all others after that are zeroed out. I also see the issue when sending or receiving to/from localhost on Cygwin. For localhost, the first 34 bytes are correct and all after that are zeroed out.
The application I am working on work fine on Centos4 talking to Centos and I am trying to port it to Cygwin. Valgrind reports no issues on Centos, I do not have Valgrind running on Cygwin. Both platforms are little-endian x86.
I've run Wireshark on the host Windows XP system under which Cygwin is running. When I sniff the packets with Wireshark they look perfect, for both sent packets from Cygwin and received packets to Cygwin.
Somehow, the data is corrupted between the level Wireshark looks at and the program itself.
The C++ code uses ::write(fd, buffer, size)
and ::read(fd, buffer, size)
to write and read the TCP packets where fd is a file descriptor for the socket that is opened between the client and server. This code works perfectly on Centos4 talking to Centos.
The strangest thing to me is that the packet sniffer shows the correct complete packet for all cases, yet the cygwin application never reads the complete packet or in the other direction, the Centos application never reads the complete packet.
Can anyone suggest how I might go about debugging this?
Here is some requested code:
size_t
read_buf(int fd, char *buf, size_t count, bool &eof, bool immediate)
{
if (count > SSIZE_MAX) {
throw;
}
size_t want = count;
size_t got = 0;
fd_set readFdSet;
int fdMaxPlus1 = fd + 1;
FD_ZERO(&readFdSet);
FD_SET(fd, &readFdSet);
while (got < want) {
errno = 0;
struct timeval timeVal;
const int timeoutSeconds = 60;
timeVal.tv_usec = 0;
timeVal.tv_sec = immediate ? 0 : timeoutSeconds;
int selectReturn = ::select(fdMaxPlus1, &readFdSet, NULL, NULL, &timeVal);
if (selectReturn < 0) {
throw;
}
if (selectReturn == 0 || !FD_ISSET(fd, &readFdSet)) {
throw;
}
errno = 0;
// Read buffer of length count.
ssize_t result = ::read(fd, buf, want - got);
if (result < 0) {
throw;
} else {
if (result != 0) {
// Not an error, increment the byte counter 'got' & the read pointer,
// buf.
got += result;
buf += result;
} else { // EOF because zero result from read.
eof = true;
break;
}
}
}
return got;
}
I've discovered more about this failure. The C++ class where the packet is being read into is laid out like this:
unsigned char _array[28];
long long _sequence;
unsigned char _type;
unsigned char _num;
short _size;
Apparently, the long long is getting scrambled with the four bytes that follow.
The C++ memory sent by Centos application, starting with _sequence, in hex, looks like this going to write():
_sequence: 45 44 35 44 33 34 43 45
_type: 05
_num: 33
_size: 02 71
Wireshark shows the memory laid out in network big-endian format like this in the packet:
_sequence: 45 43 34 33 44 35 44 45
_type: 05
_num: 33
_size: 71 02
But, after read() in the C++ cygwin little-endian application, it looks like this:
_sequence: 02 71 33 05 45 44 35 44
_type: 00
_num: 00
_size: 00 00
I'm stumped as to how this is occurring. It seems to be an issue with big-endian and little-endian, but the two platforms are both little-endian.
Here _array is 7 ints instead of 28 chars.
Complete memory dump at sender:
_array[0]: 70 a2 b7 cf
_array[1]: 9b 89 41 2c
_array[2]: aa e9 15 76
_array[3]: 9e 09 b6 e2
_array[4]: 85 49 08 81
_array[5]: bd d7 9b 1e
_array[6]: f2 52 df db
_sequence: 41 41 31 35 32 43 38 45
_type: 05
_num: 45
_size: 02 71
And at receipt:
_array[0]: 70 a2 b7 cf
_array[1]: 9b 89 41 2c
_array[2]: aa e9 15 76
_array[3]: 9e 09 b6 e2
_array[4]: 85 49 08 81
_array[5]: bd d7 9b 1e
_array[6]: f2 52 df db
_sequence: 02 71 45 05 41 41 31 35
_type: 0
_num: 0
_size: 0
Cygwin test result:
4
8
48
0x22be08
0x22be28
0x22be31
0x22be32
0x22be38
Centos test result:
4
8
40
0xbfffe010
0xbfffe02c
0xbfffe035
0xbfffe036
0xbfffe038
Upvotes: 2
Views: 601
Reputation: 881243
Hopefully final update :-)
Based on your latest update, Centos is packing your structures at the byte level whilst CygWin is not. This causes alignment problems. I'm not sure why the CygWin-to-CygWin case is having problems since the padding should be identical there but I can tell you how to fix the other case.
Using the code I gave earlier:
#include <stdio.h>
typedef struct {
unsigned char _array[28];
long long _sequence;
unsigned char _type;
unsigned char _num;
short _size;
} tType;
int main (void) {
tType t[2];
printf ("%d\n", sizeof(long));
printf ("%d\n", sizeof(long long));
printf ("%d\n", sizeof(tType));
printf ("%p\n", &(t[0]._array));
printf ("%p\n", &(t[0]._sequence));
printf ("%p\n", &(t[0]._num));
printf ("%p\n", &(t[0]._size));
printf ("%p\n", &(t[1]));
return 0;
}
If you don't want any padding, you have two choices. The first is to re-organise your structure to put the more restrictive types up front:
typedef struct {
long long _sequence;
short _size;
unsigned char _array[28];
unsigned char _type;
unsigned char _num;
} tType;
which gives you:
4
8
40
0x22cd42
0x22cd38
0x22cd5f
0x22cd40
0x22cd60
In other words, each structure is exactly 40 bytes (8 for sequence, 2 for size, 28 for array and 1 each for type and num). But this may not be possible if you want it in a specific order.
In that case, you can force the alignments to be on a byte level with:
typedef struct {
unsigned char _array[28];
long long _sequence;
unsigned char _type;
unsigned char _num;
short _size;
} __attribute__ ((aligned(1),packed)) tType;
The aligned(1)
sets it to byte alignment but that won't affect much since objects don't like having their alignments reduced. To force that, you need to use packed
as well.
Doing that gives you:
4
8
40
0x22cd3c
0x22cd58
0x22cd61
0x22cd62
0x22cd64
Earlier history for prosperity:
Well, since I wget
and ftp
huge files just fine from CygWin, my psychic debugging skills tell me it's more likely to be a problem with your code rather than the CygWin software.
In other words, regarding the sentence "the packets are corrupted between the level Wireshark looks at and the program itself", I'd be seriously looking towards the upper end of that scale rather than the lower end :-)
Usually, it's the case that you've assumed a read
will get the whole packet that was sent rather than bits at a time but, without seeing the code in question, that's a pretty wild guess.
Make sure you're checking the return value from read
to see how many bytes are actually being received. Beyond that, post the code responsible for the read
so we can give a more in-depth analysis.
Based on your posted code, it looks okay. The only thing I can suggest is that you check that the buffers you're passing in are big enough and, even if they are, make sure you print them immediately after return in case some other piece of code is corrupting the data.
In fact, in re-reading your question more closely, I'm a little confused. You state you have the same problem with your server code on both Linux and CygWin yet say it's working on Centos.
My only advice at this point is to put debugging printf
statements in that function you've shown, such as after the select
and read
calls to output the relevant variables, including got
and buf
after changing them, and also in every code path so you can see what it's doing. And also dump the entire structure byte-for-byte at the sending end.
This will hopefully show you immediately where the problem lies, especially since you seem to have data showing up in the wrong place.
And make sure your types are compatible at both ends. By that, I mean if long long
is different sizes on the two platforms, your data will be misaligned.
Okay, checking alignments at both ends, compile and run this program on both systems:
#include <stdio.h>
typedef struct {
unsigned char _array[28];
long long _sequence;
unsigned char _type;
unsigned char _num;
short _size;
} tType;
int main (void) {
tType t[2];
printf ("%d\n", sizeof(long));
printf ("%d\n", sizeof(long long));
printf ("%d\n", sizeof(tType));
printf ("%p\n", &(t[0]._array));
printf ("%p\n", &(t[0]._sequence));
printf ("%p\n", &(t[0]._num));
printf ("%p\n", &(t[0]._size));
printf ("%p\n", &(t[1]));
return 0;
}
On my CygWin, I get:
4 long size
8 long long size
48 structure size
0x22cd30 _array start (size = 28, padded to 32)
0x22cd50 _sequence start (size = 8, padded to 9???)
0x22cd59 _type start (size = 1)
0x22cd5a _size start (size = 2, padded to 6 for long long alignment).
0x22cd60 next array element.
The only odd bit there is the padding before _type but that's certainly valid though unexpected.
Check the output from Centos to see if it's incompatible. However, your statement that CygWin-to-CygWin doesn't work is incongruous with that possibility since the alinments and sizes would be compatible (unless your sending and receiving code is compiled differently).
Upvotes: 2
Reputation: 283614
Now that you've shown the data, your problem is clear. You're not controlling the alignment of your struct, so the compiler is automatically putting the 8 byte field (the long long
) on an 8 byte boundary (offset 32) from the start of the struct, leaving 4 bytes of padding.
Change the alignment to 1 byte and everything should resolve. Here's the snippet you need:
__attribute__ ((aligned (1))) __attribute ((packed))
I also suggest that you use the fixed-size types for structures being blitted across the network, e.g. uint8_t
, uint32_t
, uint64_t
Previous thoughts:
With TCP, you don't read
and write
packets. You read and write from a stream of bytes. Packets are used to carry these bytes, but boundaries are not preserved.
Your code looks like it deals with this reasonably well, you might want to update the wording of your question.
Upvotes: 5