YourMJK
YourMJK

Reputation: 1639

Swift 5: String prefix with a maximum UTF-8 length

I have a string that can contain arbitrary Unicode characters and I want to get a prefix of that string whose UTF-8 encoded length is as close as possible to 32 bytes, while still being valid UTF-8 and without changing the characters' meaning (i.e. not cutting off an extended grapheme cluster).

Consider this CORRECT example:

let string = "\u{1F3F4}\u{E0067}\u{E0062}\u{E0073}\u{E0063}\u{E0074}\u{E007F}\u{1F1EA}\u{1F1FA}"
print(string)                    // 🏴󠁧󠁢󠁳󠁣󠁴󠁿🇪🇺
print(string.count)              // 2
print(string.utf8.count)         // 36

let prefix = string.utf8Prefix(32)  // <-- function I want to implement 
print(prefix)                    // 🏴󠁧󠁢󠁳󠁣󠁴󠁿
print(prefix.count)              // 1
print(prefix.utf8.count)         // 28

print(string.hasPrefix(prefix))  // true

And this example of a WRONG implementation:

let string = "ar\u{1F3F4}\u{200D}\u{2620}\u{FE0F}\u{1F3F4}\u{200D}\u{2620}\u{FE0F}\u{1F3F4}\u{200D}\u{2620}\u{FE0F}"
print(string)                    // ar🏴‍☠️🏴‍☠️🏴‍☠️
print(string.count)              // 5
print(string.utf8.count)         // 41

let prefix = string.wrongUTF8Prefix(32)  // <-- wrong implementation 
print(prefix)                    // ar🏴‍☠️🏴‍☠️🏴
print(prefix.count)              // 5
print(prefix.utf8.count)         // 32

print(string.hasPrefix(prefix))  // false

What's an elegant way to do this? (besides trial&error)

Upvotes: 3

Views: 1954

Answers (4)

Jakob Egger
Jakob Egger

Reputation: 12041

My solution looks like this:

extension String {
    func prefix(maxUTF8Length: Int) -> String {
        if self.utf8.count <= maxUTF8Length { return self }
        var utf8EndIndex = self.utf8.index(self.utf8.startIndex, offsetBy: maxUTF8Length)
        while utf8EndIndex > self.utf8.startIndex {
            if let stringIndex = utf8EndIndex.samePosition(in: self) {
                return String(self[..<stringIndex])
            } else {
                self.utf8.formIndex(before: &utf8EndIndex)
            }
        }
        return ""
    }
}

It takes the highest possible utf8 index, checks if it is a valid character index using the Index.samePosition(in:) method. If not, it reduces the utf8 index one by one until it finds a valid character index.

The advantage is that you could replace utf8 with utf16 and it would also work.

Upvotes: 1

Gannet
Gannet

Reputation: 1453

I like the first solution you came up with. I've found it works more correctly (and simpler) if you take out the formIndex:

extension String {
    func utf8Prefix(_ maxLength: Int) -> Substring {
        if self.utf8.count <= maxLength {
            return Substring(self)
        }

        let index = self.utf8.index(self.startIndex, offsetBy: maxLength)
        return self.prefix(upTo: index)
    }
}

Upvotes: 1

YourMJK
YourMJK

Reputation: 1639

I discovered that String and String.UTF8View share the same indices, so I managed to create a very simple (and efficient?) solution, I think:

extension String {
    func utf8Prefix(_ maxLength: Int) -> Substring {
        if self.utf8.count <= maxLength {
            return Substring(self)
        }

        var index = self.utf8.index(self.startIndex, offsetBy: maxLength+1)
        self.formIndex(before: &index)
        return self.prefix(upTo: index)
    }
}

Explanation (assuming maxLength == 32 and startIndex == 0):

The first case (utf8.count <= maxLength) should be clear, that's where no work is needed.
For the second case we first get the utf8-index 33, which is either

  • A: the endIndex of the string (if it's exactly 33 bytes long),
  • B: an index at the start of a character (after 33 bytes of previous characters)
  • C: an index somewhere in the middle of a character (after <33 bytes of previous characters)

So if we now move our index back one character (with formIndex(before:)) this will jump to the first extended grapheme cluster boundary before index which in case A and B is one character before and in C the start of that character.
I any case, the utf8-index will now be guaranteed to be at most 32 and at an extended grapheme cluster boundary, so prefix(upTo: index) will safely create a prefix with length ≤32.


…but it's not perfect.
In theory this should also be always the optimal solution, i.e. the prefix's count is as close as possible to maxLength but sometimes when the string ends with an extended grapheme cluster consisting of more than one Unicode scalar, formIndex(before: &index) goes back one character too many than would be necessary, so the prefix ends up shorter. I'm not exactly sure why that's the case.

EDIT: A not as elegant but in exchange completely "correct" solution would be this (still only O(n)):

extension String {
    func utf8Prefix(_ maxLength: Int) -> Substring {
        if self.utf8.count <= maxLength {
            return Substring(self)
        }

        let endIndex = self.utf8.index(self.startIndex, offsetBy: maxLength)
        var index = self.startIndex
        while index <= endIndex {
            self.formIndex(after: &index)
        }
        self.formIndex(before: &index)
        return self.prefix(upTo: index)
    }
}

Upvotes: 4

CRD
CRD

Reputation: 53010

You've shown no attempt at a solution and SO doesn't normally write code for you. So instead here as some algorithm suggestions for you:

What's an elegant way to do this? (besides trial&error)

By what definition of elegant? (like beauty it depends on the eye of the beholder...)

Simple?

Start with String.makeIterator, write a while loop, append Characters to your prefix as long as the byte count ≤ 32.

It's a very simple loop, worse case is 32 iterations and 32 appends.

"Smart" Search Strategy?

You could implement a strategy based on the average byte length of each Character in the String and using String.Prefix(Int).

E.g. for your first example the character count is 2 and the byte count 36, giving an average of 18 bytes/character, 18 goes into 32 just once (we don't deal in fractional characters or bytes!) so start with Prefix(1), which has a byte count of 28 and leaves 1 character and 8 bytes – so the remainder has an average byte length of 8 and you are seeking at most 4 more bytes, 8 goes into 4 zero times and you are done.

The above example shows the case of extending (or not) your prefix guess. If your prefix guess is too long you can just start your algorithm from scratch using the prefix character & byte counts rather than the original string's.

If you have trouble implementing your algorithm ask a new question showing the code you've written, describe the issue, and someone will undoubtedly help you with the next step.

HTH

Upvotes: 2

Related Questions