Reputation: 1639
I have a string that can contain arbitrary Unicode characters and I want to get a prefix of that string whose UTF-8 encoded length is as close as possible to 32 bytes, while still being valid UTF-8 and without changing the characters' meaning (i.e. not cutting off an extended grapheme cluster).
Consider this CORRECT example:
let string = "\u{1F3F4}\u{E0067}\u{E0062}\u{E0073}\u{E0063}\u{E0074}\u{E007F}\u{1F1EA}\u{1F1FA}"
print(string) // 🏴🇪🇺
print(string.count) // 2
print(string.utf8.count) // 36
let prefix = string.utf8Prefix(32) // <-- function I want to implement
print(prefix) // 🏴
print(prefix.count) // 1
print(prefix.utf8.count) // 28
print(string.hasPrefix(prefix)) // true
And this example of a WRONG implementation:
let string = "ar\u{1F3F4}\u{200D}\u{2620}\u{FE0F}\u{1F3F4}\u{200D}\u{2620}\u{FE0F}\u{1F3F4}\u{200D}\u{2620}\u{FE0F}"
print(string) // ar🏴☠️🏴☠️🏴☠️
print(string.count) // 5
print(string.utf8.count) // 41
let prefix = string.wrongUTF8Prefix(32) // <-- wrong implementation
print(prefix) // ar🏴☠️🏴☠️🏴
print(prefix.count) // 5
print(prefix.utf8.count) // 32
print(string.hasPrefix(prefix)) // false
What's an elegant way to do this? (besides trial&error)
Upvotes: 3
Views: 1954
Reputation: 12041
My solution looks like this:
extension String {
func prefix(maxUTF8Length: Int) -> String {
if self.utf8.count <= maxUTF8Length { return self }
var utf8EndIndex = self.utf8.index(self.utf8.startIndex, offsetBy: maxUTF8Length)
while utf8EndIndex > self.utf8.startIndex {
if let stringIndex = utf8EndIndex.samePosition(in: self) {
return String(self[..<stringIndex])
} else {
self.utf8.formIndex(before: &utf8EndIndex)
}
}
return ""
}
}
It takes the highest possible utf8 index, checks if it is a valid character index using the Index.samePosition(in:)
method. If not, it reduces the utf8 index one by one until it finds a valid character index.
The advantage is that you could replace utf8 with utf16 and it would also work.
Upvotes: 1
Reputation: 1453
I like the first solution you came up with. I've found it works more correctly (and simpler) if you take out the formIndex
:
extension String {
func utf8Prefix(_ maxLength: Int) -> Substring {
if self.utf8.count <= maxLength {
return Substring(self)
}
let index = self.utf8.index(self.startIndex, offsetBy: maxLength)
return self.prefix(upTo: index)
}
}
Upvotes: 1
Reputation: 1639
I discovered that String
and String.UTF8View
share the same indices, so I managed to create a very simple (and efficient?) solution, I think:
extension String {
func utf8Prefix(_ maxLength: Int) -> Substring {
if self.utf8.count <= maxLength {
return Substring(self)
}
var index = self.utf8.index(self.startIndex, offsetBy: maxLength+1)
self.formIndex(before: &index)
return self.prefix(upTo: index)
}
}
maxLength == 32
and startIndex == 0
):The first case (utf8.count <= maxLength
) should be clear, that's where no work is needed.
For the second case we first get the utf8-index 33
, which is either
So if we now move our index back one character (with formIndex(before:)
) this will jump to the first extended grapheme cluster boundary before index
which in case A and B is one character before and in C the start of that character.
I any case, the utf8-index will now be guaranteed to be at most 32
and at an extended grapheme cluster boundary, so prefix(upTo: index)
will safely create a prefix with length ≤32.
…but it's not perfect.
In theory this should also be always the optimal solution, i.e. the prefix's count
is as close as possible to maxLength
but sometimes when the string ends with an extended grapheme cluster consisting of more than one Unicode scalar, formIndex(before: &index)
goes back one character too many than would be necessary, so the prefix ends up shorter. I'm not exactly sure why that's the case.
EDIT: A not as elegant but in exchange completely "correct" solution would be this (still only O(n)):
extension String {
func utf8Prefix(_ maxLength: Int) -> Substring {
if self.utf8.count <= maxLength {
return Substring(self)
}
let endIndex = self.utf8.index(self.startIndex, offsetBy: maxLength)
var index = self.startIndex
while index <= endIndex {
self.formIndex(after: &index)
}
self.formIndex(before: &index)
return self.prefix(upTo: index)
}
}
Upvotes: 4
Reputation: 53010
You've shown no attempt at a solution and SO doesn't normally write code for you. So instead here as some algorithm suggestions for you:
What's an elegant way to do this? (besides trial&error)
By what definition of elegant? (like beauty it depends on the eye of the beholder...)
Simple?
Start with String.makeIterator
, write a while
loop, append Character
s to your prefix as long as the byte count ≤ 32.
It's a very simple loop, worse case is 32 iterations and 32 appends.
"Smart" Search Strategy?
You could implement a strategy based on the average byte length of each Character
in the String
and using String.Prefix(Int)
.
E.g. for your first example the character count is 2 and the byte count 36, giving an average of 18 bytes/character, 18 goes into 32 just once (we don't deal in fractional characters or bytes!) so start with Prefix(1)
, which has a byte count of 28 and leaves 1 character and 8 bytes – so the remainder has an average byte length of 8 and you are seeking at most 4 more bytes, 8 goes into 4 zero times and you are done.
The above example shows the case of extending (or not) your prefix guess. If your prefix guess is too long you can just start your algorithm from scratch using the prefix character & byte counts rather than the original string's.
If you have trouble implementing your algorithm ask a new question showing the code you've written, describe the issue, and someone will undoubtedly help you with the next step.
HTH
Upvotes: 2