Reputation: 56
I maintain a Python library which validates and prepares input for a downstream Java service. As such, the pre-validation within the library needs to be consistent with this downstream service. A pain point here has been calculating string length for certain Unicode strings.
Python counts characters to determine the length of a string, while Java counts code units (i.e. UTF-16 surrogate pairs). Usually these calculations are the same, but beyond the Basic Multilingual Plane these can differ. For example, the string "wink 😉" would have length 6 in Python and length 7 in Java (2 for the emoji + 5 for the other characters).
To replicate Java's length calculation methodology, therefore, we need to encode as UTF-16 and then divide by 2:
field_value = "wink 😉"
len(field_value.encode("utf-16-le")) // 2
However, if I want to truncate an input string to the maximum permitted character limit based on a UTF-16 codepair methodology this is more challenging. Converting to UTF-16 then slicing is overzealous since not ALL the characters will be outside of the BMP:
field_value = "wink 😉"
field_value.encode("utf-16-le")[:LIMIT].decode("utf-16-le", "ignore")
What would be an efficient way in Python to truncate a Unicode string (containing BMP + post-BMP characters) in line with this character weighting?
Upvotes: 2
Views: 435
Reputation: 177600
Here's a function to truncate at a valid codepoint in the string. It works by testing that a too long string doesn't truncate in the middle of a surrogate pair. It's based on this similar answer of mine for truncating UTF-8. Note this does not handle graphemes. You can use unicodedata.category()
if needed to test for truncating modifiers.
s = 'A 😉 short 😉😉 test'
def utf16_trailing_surrogate(b):
'''The high byte of a UTF-16 trailing surrogate starts with the bits 110111xx.'''
return (b & 0b1111_1100) == 0b1101_1100
def utf16_byte_truncate(text, max_bytes):
'''If text[max_bytes:max_bytes+1] is a trailing surrogate, back up two bytes and truncate.
'''
i = max_bytes - max_bytes % 2 # make even
utf16 = text.encode('utf-16le')
if len(utf16) <= i: # does it fit
return utf16
if utf16_trailing_surrogate(utf16[i+1]):
i -= 2
return utf16[:i]
# test for various max_bytes:
for m in range(len(s.encode('utf-16le'))+1):
b = utf16_byte_truncate(s,m)
print(f'{m:2} {len(b):2} {b.decode("utf-16le")!r}')
Output:
0 0 ''
1 0 ''
2 2 'A'
3 2 'A'
4 4 'A '
5 4 'A '
6 4 'A '
7 4 'A '
8 8 'A 😉'
9 8 'A 😉'
10 10 'A 😉 '
11 10 'A 😉 '
12 12 'A 😉 s'
13 12 'A 😉 s'
14 14 'A 😉 sh'
15 14 'A 😉 sh'
16 16 'A 😉 sho'
17 16 'A 😉 sho'
18 18 'A 😉 shor'
19 18 'A 😉 shor'
20 20 'A 😉 short'
21 20 'A 😉 short'
22 22 'A 😉 short '
23 22 'A 😉 short '
24 22 'A 😉 short '
25 22 'A 😉 short '
26 26 'A 😉 short 😉'
27 26 'A 😉 short 😉'
28 26 'A 😉 short 😉'
29 26 'A 😉 short 😉'
30 30 'A 😉 short 😉😉'
31 30 'A 😉 short 😉😉'
32 32 'A 😉 short 😉😉 '
33 32 'A 😉 short 😉😉 '
34 34 'A 😉 short 😉😉 t'
35 34 'A 😉 short 😉😉 t'
36 36 'A 😉 short 😉😉 te'
37 36 'A 😉 short 😉😉 te'
38 38 'A 😉 short 😉😉 tes'
39 38 'A 😉 short 😉😉 tes'
40 40 'A 😉 short 😉😉 test'
Upvotes: 1