Truncating a UTF-16 string

I maintain a Python library which validates and prepares input for a downstream Java service. As such, the pre-validation within the library needs to be consistent with this downstream service. A pain point here has been calculating string length for certain Unicode strings.

Python counts characters to determine the length of a string, while Java counts code units (i.e. UTF-16 surrogate pairs). Usually these calculations are the same, but beyond the Basic Multilingual Plane these can differ. For example, the string "wink 😉" would have length 6 in Python and length 7 in Java (2 for the emoji + 5 for the other characters).

To replicate Java's length calculation methodology, therefore, we need to encode as UTF-16 and then divide by 2:

field_value = "wink 😉"    
len(field_value.encode("utf-16-le")) // 2

However, if I want to truncate an input string to the maximum permitted character limit based on a UTF-16 codepair methodology this is more challenging. Converting to UTF-16 then slicing is overzealous since not ALL the characters will be outside of the BMP:

field_value = "wink 😉"  
field_value.encode("utf-16-le")[:LIMIT].decode("utf-16-le", "ignore")

What would be an efficient way in Python to truncate a Unicode string (containing BMP + post-BMP characters) in line with this character weighting?

Upvotes: 2

Views: 435

Answers (1)

Mark Tolonen
Mark Tolonen

Reputation: 177600

Here's a function to truncate at a valid codepoint in the string. It works by testing that a too long string doesn't truncate in the middle of a surrogate pair. It's based on this similar answer of mine for truncating UTF-8. Note this does not handle graphemes. You can use unicodedata.category() if needed to test for truncating modifiers.

s = 'A 😉 short 😉😉 test'

def utf16_trailing_surrogate(b):
    '''The high byte of a UTF-16 trailing surrogate starts with the bits 110111xx.'''
    return (b & 0b1111_1100) == 0b1101_1100

def utf16_byte_truncate(text, max_bytes):
    '''If text[max_bytes:max_bytes+1] is a trailing surrogate, back up two bytes and truncate.
    '''
    i = max_bytes - max_bytes % 2  # make even
    utf16 = text.encode('utf-16le')
    if len(utf16) <= i: # does it fit
        return utf16
    if utf16_trailing_surrogate(utf16[i+1]):
        i -= 2
    return utf16[:i]

# test for various max_bytes:
for m in range(len(s.encode('utf-16le'))+1):
    b = utf16_byte_truncate(s,m)
    print(f'{m:2} {len(b):2} {b.decode("utf-16le")!r}')

Output:

 0  0 ''
 1  0 ''
 2  2 'A'
 3  2 'A'
 4  4 'A '
 5  4 'A '
 6  4 'A '
 7  4 'A '
 8  8 'A 😉'
 9  8 'A 😉'
10 10 'A 😉 '
11 10 'A 😉 '
12 12 'A 😉 s'
13 12 'A 😉 s'
14 14 'A 😉 sh'
15 14 'A 😉 sh'
16 16 'A 😉 sho'
17 16 'A 😉 sho'
18 18 'A 😉 shor'
19 18 'A 😉 shor'
20 20 'A 😉 short'
21 20 'A 😉 short'
22 22 'A 😉 short '
23 22 'A 😉 short '
24 22 'A 😉 short '
25 22 'A 😉 short '
26 26 'A 😉 short 😉'
27 26 'A 😉 short 😉'
28 26 'A 😉 short 😉'
29 26 'A 😉 short 😉'
30 30 'A 😉 short 😉😉'
31 30 'A 😉 short 😉😉'
32 32 'A 😉 short 😉😉 '
33 32 'A 😉 short 😉😉 '
34 34 'A 😉 short 😉😉 t'
35 34 'A 😉 short 😉😉 t'
36 36 'A 😉 short 😉😉 te'
37 36 'A 😉 short 😉😉 te'
38 38 'A 😉 short 😉😉 tes'
39 38 'A 😉 short 😉😉 tes'
40 40 'A 😉 short 😉😉 test'

Upvotes: 1

Related Questions