Reputation: 43
I'm wondering, what would be a good/efficient way to delimit a string that can contain basically any character. so for instance, I need to concatenate n strings that can look like:
char *str_1 = "foo; for|* 1.234+\"@!`";
char *str_n = "bar; for|* 1.234+%\"@`";
for a final string as:
char *str_final = "foo; for|* 1.234+\"@!`bar; for|* 1.234+%\"@`"; // split?
Which delimiter could I use to properly split it?
Note that there could be more than 2 string to concatenate.
I'm open for suggestions.
Thanks
Upvotes: 4
Views: 485
Reputation: 75429
Because my comments kept getting longer and longer, here is a full answer:
Your char *
buffer should store the length of the string in the first X bytes (like how Pascal does it). After that length comes the string data, which can contain any characters you like. After that, the next X bytes tell you the length of the next string. So on and so forth, until the end, which is delimited by an empty string (i.e. the last X bytes claim that the next string has zero length, and your application takes this as the signal to stop looking for more strings).
One benefit is that you don't need to scan through the string data - finding the next string from the beginning of the first string takes O(1) time, finding how many strings there are in your list takes O(n) time but will still be blazingly fast (if O(n) is unacceptable you can work around this, but I don't think that's worth getting into right now).
Another benefit is that the string data can contain any character you like. This can be a con - if your string might contain the NUL character, you can safely extract it, but you have to be careful not to pass it to a C string function (like strlen()
or strcat()
), which will see the NUL character as the end of your data (which it may or may not be). You'll have to rely on memcpy()
and pointer arithmetic.
The issue is the value of X (the number of bytes you use to store the string length). The easiest would be 1, which would bypass all endianness and alignment issues, but would limit your strings to 255 characters. If this is a limitation you can live with, excellent, but 255 seems a little low to me.
X could be 2 or 4 bytes, but you would need to make sure you have an (unsigned) data type that is at least that many bytes (stdint.h
's uint16_t
or uint32_t
, or maybe uint_least16_t
or uint_least32_t
). A better solution would be to make X = sizeof(size_t)
, since the size_t
type is guaranteed to be able to store the length of any string you could want to store.
Having X > 1
introduces alignment and, if network portability is an issue, endianness. The simplest way to read the first X bytes as a size_t
variable would be to cast your char *
data to a size_t *
and just dereference. However, unless you can guarantee that your char *
data is aligned properly, this will break on some systems. Even if you do guarantee the alignment of your char *
data, you'll have to waste a few bytes at the end of most strings to make sure the next string's length value is aligned.
The easiest way to overcome alignment is to manually convert the first sizeof(size_t)
bytes to a size_t
value. You'll have to decide if you want the data to be stored little- or big-endian. Most computers will be little-endian natively, but for a manual conversion this won't matter - just pick one. The number 65537 (2 ^ 16 + 2) stored in 4 bytes, big-endian, looks like { 0, 1, 0, 2 }
; little-endian, { 2, 0, 1, 0 }
.
Once you've decided that (it doesn't matter, pick whichever one you like), you just cast the first X points of data to unsigned char
s, then to size_t
, then do a bit-shift by the appropriate exponent to put them in the proper place, then add them all together. In the above examples, 0 would be multiplied by 2 ^ 32, 1 by 2 ^ 16, 0 by 2 ^ 8, and 2 by 2 ^ 0 (or 1), producing 0 + 65536 + 0 + 2 or 65537. There probably will be zero efficiency difference between big- and little-endian if you're doing the manual conversion - I want to point out (again) that the choice is entirely arbitrary as far as I can tell.
Doing a manual conversion avoids alignment issues, and completely bypasses concerns about cross-system endianness, so data transferred from a little-endian computer to a big-endian one will be read the same. There is still a potential problem about data being transferred from a system where sizeof(size_t) == 4
to one where sizeof(size_t) == 8
. If this is a problem, you can either a) ditch size_t
and choose an invariant size, or b) encode (a single byte is all you need) the value of sizeof(size_t)
for the sender as the first byte of data, and have the receiver make any necessary adjustments. Choice a) may be easier, but may cause problems (what if you pick a size too low to account for legacy computers on your network, and as they're phased out you start running out of room to store your data?), so I would prefer choice b) since it scales with whatever system you're running (16-bit, 32-bit, 64-bit, maybe even in the future 128-bit), but that kind of effort may not be necessary for you.
</vomit>
I leave it to the reader to sort out all that mess I just wrote.
Upvotes: 3
Reputation: 4378
One solution is to chose an escape character and a delimiter. Typically the backslash \
is used as an escape character, but this may lead to confusion as it's already the escape character for string literals. The choice really doesn't matter, let's take the forward slash /
as escape and the semicolon ;
as delimiter. Ideally chose two characters that are least likely to occur in your strings.
When you concatenate strings, the first step is to search for both characters in the unencoded strings and substitute them by the escaped version:
str1 = "foo;bar;baz";
str2 = "foo/bar/baz";
becomes
estr1 = "foo/;bar/;baz";
estr2 = "foo//bar//baz";
Then they are concatenated with the delimiter:
res = "foo/;bar/;baz;foo//bar//baz";
That's it. Splitting is done by searching for the delimiter without a leading escape character and then substituting escaped characters in the single strings back to the unescaped version.
This is a good choice if you want to work with the strings with functions that await a single zero-terminated string, e.g. using the str
functions or to print them with the printf
functions. If you can guarantee that only your own functions will work with these strings, then the mentioned delimiting with zeros \0
is more efficient, especially since you don't really need to split it, you can use a pointer into the full string to use a single partial string from it when using str
or printf
functions.
Upvotes: 2
Reputation: 215387
If you know your strings will always be valid UTF-8 text (or ASCII), you could use a byte that cannot appear in valid UTF-8 (or ASCII) as a delimiter. In UTF-8, bytes C0, C1, F5, F6, F7, F8, F9, FA, FB, FC, FD, FE, and FF are invalid. In ASCII, any byte with the high bit set is invalid.
Upvotes: 2
Reputation: 44131
Perhaps you could encode the length of the string followed by a special character in front of every string? This way you don't have to worry about what characters are in the next N characters. It may be a good idea to null terminate each substring as well.
The one advantage of this approach is that you'll be able to parse through the string quite fast.
EDIT: An even better approach is to use the first 2-4 bytes as suggested by Chris in the comment below instead of an encoded length + special character.
Upvotes: 3
Reputation: 355187
One option is to use the null character as a delimiter and double null terminate the list. of strings. It would look something like this:
const char* str_final = "foo; for|* 1.234+\"@!`\0bar; for|* 1.234+%\"@`\0";
delimiter ^ delimiter ^
Raymond Chen gave a good overview of the double null terminated string in a blog post. It's used by several functions in the Windows API.
Upvotes: 2
Reputation: 14890
2 ideas:
1) Use standard "escape" approach, something symilar to defining a char* literal in C.
2) Use one '\0'
character as delimiter, and two of them as end of string marker.
Upvotes: 1