Truncating a UTF-16 string

Question

I maintain a Python library which validates and prepares input for a downstream Java service. As such, the pre-validation within the library needs to be consistent with this downstream service. A pain point here has been calculating string length for certain Unicode strings.

Python counts characters to determine the length of a string, while Java counts code units (i.e. UTF-16 surrogate pairs). Usually these calculations are the same, but beyond the Basic Multilingual Plane these can differ. For example, the string "wink 😉" would have length 6 in Python and length 7 in Java (2 for the emoji + 5 for the other characters).

To replicate Java's length calculation methodology, therefore, we need to encode as UTF-16 and then divide by 2:

field_value = "wink 😉"    
len(field_value.encode("utf-16-le")) // 2

However, if I want to truncate an input string to the maximum permitted character limit based on a UTF-16 codepair methodology this is more challenging. Converting to UTF-16 then slicing is overzealous since not ALL the characters will be outside of the BMP:

field_value = "wink 😉"  
field_value.encode("utf-16-le")[:LIMIT].decode("utf-16-le", "ignore")

What would be an efficient way in Python to truncate a Unicode string (containing BMP + post-BMP characters) in line with this character weighting?

Truncating a UTF-16 string

Answers (1)

Related Questions