Reputation: 14868
I've been doing some preliminary effort implementing UTF8String
for which I had to address the problems related with messages such as #size
, #at:
, #do:
, etc. Among these there are some for which I could not find a good solution. Examples include #new:
(class side) and #at:put:
(instance) because the number of bytes they would need (or use) depends on the actual characters the string will eventually contain.
One idea that one could consider is to allocate additional (unused) null bytes in the tail which would actually not be part of the string and use #become:
only in those cases were one runs out of null positions. Is this a good (or bad) idea? How should a proper implementation work?
Upvotes: 3
Views: 399
Reputation: 15917
If you can afford to spend the effort, you can do much better than going 32 bit for all characters. Actual texts are either all-ascii (english language, programs), have some non-ascii characters (german, french) or are nearly complete multi-byte. For those with a few non-ascii, you can keep a supporting data structure to help with #at: etc.
Upvotes: 0
Reputation: 9572
One solution would be to hold the sequence of bytes into an instance variable (a ByteArray) anf thus use a normal pointer based subclass instead of using a variableByteSubclass.
Then the strategy of pre-allocating extra bytes can be easily implemented since you would store effective size into another instance variable. Up to you to tune code complexity/efficiency, memory/speed balance.
The advantage is to avoid messing with other VM primitives like copyReplaceFrom:to:with:startingAt: which can transfer raw encoding from one byte oriented class to another, potentially creating erroneous interpretation of the encodings.
Another advantage is that you don't need to invoke the become: super-power.
Upvotes: 2
Reputation: 4623
IMHO it's best to use UTF8 only for import and export. Internally, use 32 bits for characters.
Upvotes: 2