Manouil Gioulountas
Manouil Gioulountas

Reputation: 11

Excel VBA special characters in literal strings being changed

I have a macro that inserts a few literal strings into an excel file to be converted to a txt file. These literal strings have some German special characters. The macro works as expected for colleagues, except for one particular Greek colleague (other Greek colleagues have no issue).

When that specific colleague reviews the macro's code in the visual basic editor, the German special characters are changed to Greek characters. If it were just visual, I wouldn't care, but this also changes the content of the txt file that is generated.

Example:

correct string = "Empfänger"

incorrect string = "Empfδnger"

I tried:

It is almost certainly related to regional settings, but I have been unsuccessful in finding what specific options it is. Found some posts that directed me to Language for non-unicode programs settings, but that seemingly wasn't the issue. The Region > Administrative > Language for non-unicode programs is showing Greek for the user, but it also shows as Greek for the user in which there is no issue.

If anyone can help me track down the issue, I would greatly appreciate it.

Upvotes: 1

Views: 2400

Answers (1)

GWD
GWD

Reputation: 3998

As you are already aware, the Visual Basic Editor is a non-Unicode app and uses a system-locale-dependent 8-bit character encoding (ANSI) for the source code. The string literals in the source code are then transcoded from this locale-specific non-Unicode codepage to UTF-16 at some point before or during compilation.

As I can not reproduce the behavior you are describing

...it also shows as Greek for the user in which there is no issue.

I can only offer a workaround in this answer, because as far as I know (and as per my testing) the thing you describe as issue is the expected behavior:

correctString = "Empfänger" 'Expected for users with e.g. German system-locale
incorrectString = "Empfδnger" 'Expected for users with Greek system-locale

If you use string literals in code (for which there are definitely many legitimate use cases) only characters from the ASCII codepoint range will behave consistently across all system locales.

You can therefore work around this problem by using some functions from my library VBA-StringTools.

The functions you need are the following:

'Replaces all occurences of unicode characters outside the codePoint range
'defined by maxNonEncodedCharCode with literals of the following formattings:
'   \uXXXX      for characters inside the basic multilingual plane
'   \uXXXXXXXX  for characters outside the basic multilingual plane
'Where:
'   Xes are the digits of the codepoint in hexadecimal. (X = 0-9 or A-F)
Public Function EscapeUnicode(ByRef str As String, _
                     Optional ByVal maxNonEncodedCharCode As Long = &HFF) _
                              As String
    Dim codepoint As Long
    Dim i As Long
    Dim j As Long:          j = 1
    Dim result() As String: ReDim result(1 To Len(str))

    For i = 1 To Len(str)
        codepoint = AscW(Mid$(str, i, 1)) And &HFFFF&

        If codepoint >= &HD800& Then codepoint = AscU(Mid$(str, i, 2))

        If codepoint > &HFFFF& Then 'Outside BMP
            result(j) = "\u" & "00" & Right$("0" & Hex(codepoint), 6)
            i = i + 1
        ElseIf codepoint > maxNonEncodedCharCode Then 'BMP
            result(j) = "\u" & Right$("00" & Hex(codepoint), 4)
        Else
            result(j) = Mid$(str, i, 1)
        End If
        j = j + 1
    Next i
    EscapeUnicode = Join(result, "")
End Function

#If Mac = 0 Then
'Replaces all occurences of unicode literals
'Accepts the following formattings `escapeFormat`:
' efPython = 1 … \uXXXX \u00XXXXXX   (4 or 8 hex digits, 8 for chars outside BMP)
' efRust   = 2 … \u{XXXX} \U{XXXXXX} (1 to 6 hex digits)
' efUPlus  = 4 … u+XXXX u+XXXXXX     (4 or 6 hex digits)
' efMarkup = 8 … &#ddddddd;          (1 to 7 decimal digits)
'Where:
'   - prefixes \u is case insensitive
'   - Xes are the digits of the codepoint in hexadecimal. (X = 0-9 or A-F/a-f)
'Example:
'   - "abcd au+0062\U0063xy\u{64}", efAll returns "abcd abcxyd"
'Notes:
'   - Avoid u+XXXX syntax if string contains literals without delimiters as it
'     can be misinterpreted if adjacent to text starting with 0-9 or a-f.
'   - This function can be slow for very long input strings with many
'     different literals
Private Function UnescapeUnicode(ByRef str As String, _
                        Optional ByVal allowSingleSurrogates As Boolean = False) _
                                 As String
    Const PATTERN_UNICODE_LITERALS As String = _
        "\\u00[01][0-9a-f]{5}|\\u[0-9a-f]{4}|" & _
        "\\u{[0-9a-f]{1,6}}|u\+[0-9a-f]{4,6}|&#\d{1,7};"
    With CreateObject("VBScript.RegExp")
        .Global = True
        .MultiLine = True
        .IgnoreCase = True
        .Pattern = PATTERN_UNICODE_LITERALS
        Dim mc As Object: Set mc = .Execute(str)
    End With

    Dim match As Variant
    Dim codepoint As Long
    Dim dupeCheck As Collection
    Set dupeCheck = New Collection
    Dim isDuplicate As Boolean

    For Each match In mc
        Dim mv As String: mv = match.value
        On Error Resume Next
        dupeCheck.Add 1, mv
        isDuplicate = Err.number <> 0
        On Error GoTo 0
        If Not isDuplicate Then
            If Left$(mv, 1) = "&" Then
                codepoint = CLng(Mid$(mv, 3, Len(mv) - 3))
            Else
                If Mid$(mv, 3, 1) = "{" Then
                    codepoint = CLng("&H" & Mid$(mv, 4, Len(mv) - 4))
                Else
                    codepoint = CLng("&H" & Mid$(mv, 3))
                End If
            End If
            If codepoint < &H110000 Then
                If codepoint < &HD800& Or codepoint >= &HE000& _
                Or allowSingleSurrogates Then _
                    str = Replace(str, mv, ChrU(codepoint))
            End If
        End If
    Next match
    UnescapeUnicode = str
End Function
#End If

'Returns the given unicode codepoint as standard VBA UTF-16LE string
 Public Function ChrU(ByVal codepoint As Long, _
             Optional ByVal allowSingleSurrogates As Boolean = False) As String
    Const methodName As String = "ChrU"

    If codepoint < &H8000 Then Err.Raise 5, methodName, "Codepoint < -32768"
    If codepoint < 0 Then codepoint = codepoint And &HFFFF& 'Incase of uInt input

    If codepoint < &HD800& Then
        ChrU = ChrW$(codepoint)
    ElseIf codepoint < &HE000& And Not allowSingleSurrogates Then
        Err.Raise 5, methodName, "Range reserved for surrogate pairs"
    ElseIf codepoint < &H10000 Then
        ChrU = ChrW$(codepoint)
    ElseIf codepoint < &H110000 Then
        codepoint = codepoint - &H10000
        ChrU = ChrW$(&HD800& Or (codepoint \ &H400&)) & _
               ChrW$(&HDC00& Or (codepoint And &H3FF&))
    Else
        Err.Raise 5, methodName, "Codepoint outside of valid Unicode range."
    End If
End Function

'Returns a given characters unicode codepoint as long.
'Note: One unicode character can consist of two VBA "characters", a so-called
'      "surrogate pair" (input string of length 2, so Len(char) = 2!)
Public Function AscU(ByRef char As String) As Long
    AscU = AscW(char) And &HFFFF&
    If Len(char) > 1 Then
        Dim lo As Long: lo = AscW(Mid$(char, 2, 1)) And &HFFFF&
        If &HDC00& > lo Or lo > &HDFFF& Then Exit Function
        AscU = (AscU - &HD800&) * &H400& + (lo - &HDC00&) + &H10000
    End If
End Function

When developing the macro, you use the EscapeUnicode function to replace all Unicode characters outside the ASCII range with escape sequences. For the string in your question, you would put the following line into the immediate window:

?EscapeUnicode("Empfänger", 127) '127 is the last character in the ASCII range

which will return the escaped string Empf\u00E4nger.

In your actual code, you now use this string together with the UnescapeUnicode function:

correctString = UnescapeUnicode("Empf\u00E4nger")

The variable will now contain the correct string when inserted into a cell on the worksheet, regardless of system-locale.

Maybe this workaround is more suitable for you than creating a hidden sheet to read the string literals from.

Note that the function UnescapeUnicode provided in this answer only works on Windows because it uses regular expressions. The function from the GitHub repository supports MacOS and performs much better, so I'd recommend just adding the LibStringTools module to your project.

Upvotes: 2

Related Questions