Colonel Panic
Colonel Panic

Reputation: 137524

How to identify zero-width character?

Visual Studio 2015 found an unexpected character in my code (error CS1056)

How can I identify what the character is? It's a zero-width character so I can't see it. I'd like to know exactly what it is so I can work out where it comes from and how to fix it with a find-and-replace (I have many similar errors).

Here's an example. There's a zero-width character between x and y in the quote below:

x​y

It would be helpful just to tell me the name of the character in my example, but I'd also like to know generally how to identify characters myself.

Upvotes: 9

Views: 10486

Answers (4)

Jon Skeet
Jon Skeet

Reputation: 1499770

I have a little bit of Javascript embedded within my explanation of Unicode which allows you to see the Unicode characters you copy/paste into a textbox. Your example looks like this:

Unicode explorer

Here you can see that the character is U+200B. Just searching for that will normally lead you to http://www.fileformat.info, in this case this page which can give you details of the character.

If you have the characters yourself within an application, Char.GetUnicodeCategory is your friend. (Oddly enough, there's no Char.GetUnicodeCategory(int) for non-BMP characters as far as I can see...)

Upvotes: 11

quetzalcoatl
quetzalcoatl

Reputation: 33506

According to similar question: Remove zero-width space characters from a JavaScript string

I'd hit ctrl+f (or ctrl+h) and turn on Regexp option, then search (or search-replace) for:

[\u200B-\u200D\uFEFF]

I've just tried your example and successfully replaced that zero-width space with "X" mark.

Just please note that this range covers only a few specific characters as explained in that post, not all invisible characters.

edit - thanks to this page I've found a better expression that seems nicely supported in the "find/replace" when Regexp option is turned on:

\p{Cf}

which seems to matches invisible characters, it successfully hit that one in your example, though I'm not exactly sure if it covers all you'd need. It may be worth playing with whole {C}-class or searching for whitespace|nonprintable plus negative match for {Z}-class (or {Zs}) negation.

Upvotes: 3

Lasse V. Karlsen
Lasse V. Karlsen

Reputation: 391276

You can ask the built-in Unicode table:

var category = char.GetUnicodeCategory(s[1]);

The specific character in your example is in the Format category and here is what MSDN has to say about it:

Format character that affects the layout of text or the operation of text processes, but is not normally rendered. Signified by the Unicode designation "Cf" (other, format). The value is 15.

To get the character code, simply extract it:

char c = s[1];
int codepoint = (int)c; // gives you 0x200B

The unicode codepoint 0x200b is known as "zero width space".

Upvotes: 0

Colonel Panic
Colonel Panic

Reputation: 137524

Aha, use this website http://www.fileformat.info/info/unicode/char/search.htm?q=%E2%80%8B&preview=entity

Are you looking for Unicode character U+200B: ZERO WIDTH SPACE?

http://www.fileformat.info/info/unicode/char/200b/index.htm

Upvotes: 0

Related Questions