Reputation: 71
I am trying to validate a file's content when is uploaded and I am stuck at the Unicode encoding. I am not interested to find Unicode special characters, that are not in the ASCII range. I am trying to find if the content of the file contains at least one Unicode pattern, like \u0046 for example.
For example, I exclude any file that contains the 'script' word, but what if the file contains this word written in Unicode? Sure, Java decodes it into a normal string when it reads the content, but what if I can't rely on this?
So, as far as I have searched on the Internet, I've seen Unicode characters written like \u0046, or like U+0046. Based on this, I have written the following regex:
(\\u|U\+)....
This means, \u or U+ followed by four characters. This pattern accomplishes what I desire, but I wonder if there are any other ways to write a Unicode character. It is always \u or U+? Can it be more or less than 4 characters after \u or U+?
Thanks
Upvotes: 4
Views: 5701
Reputation: 109547
The notation U+Any-number-of-hex-digits belongs to Unicode will not be functional anywhere in code. In java source code and *.properties \u
followed by four hex digits is a UTF-16 encoding of Unicode, automatically parsed.
The pattern to search for that:
"\\\\u[0-9A-Fa-f]{4}"
Or a String.contains on:
"\\u"
In other languages than Java \Uxxxxxx
(six hex chars) is possible, for the full UTF-32 range. Unfortunately upto Java 8 not so.
Upvotes: 4