user3402011
user3402011

Reputation: 69

Remove four byte UTF-8 characters in classic ASP/VBScript (MySQL related)

I've spent about 18 hours of trying different things and searching around now, finally I give up and have to ask you guys.

Backstory: I am finally migrating a old MS Access database to MySQL (version 5.6.16-log).

Problem: Some Unicode text in the Access database contain four bytes (UTF-8).

MySQL still has a problem with inserting four bytes UTF-8 characters. This problem is getting old and I was surprised to discover it's not fixed yet: http://bugs.mysql.com/bug.php?id=67297

I'm using "MySQL ODBC 5.3 Unicode Driver" to transfer data between databases (the latest beta development release). No matter what I try the process ends up freezing when I try to insert the string with 4 byte UTF8 characters (the thread uses 100% CPU forever). Have tried all workarounds suggested everywhere on the Internet, nothing works.

Now I will just accept the limitations of MySQL: I can't store all Unicode characters.

So I want to remove all 4 byte UTF8 characters from the text before I insert it into the database. But I can't for the life of me find a way to do it in classic ASP.

Can anybody help?

(I can't not use ASP btw, there is way too much code to rewrite it in a different language. Just changing databases is a remarkable feat; there are several of them and it will take days to complete.)

Edit: A solution in JScript is also acceptable, since it can be run from ASP pages.

Upvotes: 2

Views: 4202

Answers (1)

Nathan Rice
Nathan Rice

Reputation: 3111

This should work:

Function UTF8Filter(strString)
    On Error Resume Next
    For i = 1 to Len(strString)

        charCode = AscW(Mid(strString, i, 1))
        If charCode > 32 AND charCode <= 127 then   ' here was OR 
            'Append valid character'
            strString = Mid(strString, i, 1)
        End If
    Next

    UTF8Filter = strString
    On Error Goto 0
End Function

Updated function:

Function Remove4ByteUTF8(strString)
    Set objRegEx = CreateObject("VBScript.RegExp")
    objRegEx.Global = True   
    objRegEx.IgnoreCase = True
    objRegEx.Pattern = "/[\xF0-\xF7].../s"

    Remove4ByteUTF8 = objRegEx.Replace(strString, "")
End Function

Upvotes: 2

Related Questions