Toon Krijthe
Toon Krijthe

Reputation: 53426

What are the experiences with using unicode in identifiers

These days, more languages are using unicode, which is a good thing. But it also presents a danger. In the past there where troubles distinguising between 1 and l and 0 and O. But now we have a complete new range of similar characters.

For example:

ì, î, ï, ı, ι, ί, ׀ ,أ ,آ, ỉ, ﺃ

With these, it is not that difficult to create some very hard to find bugs.

At my work, we have decided to stay with the ANSI characters for identifiers. Is there anybody out there using unicode identifiers and what are the experiences?

Upvotes: 8

Views: 684

Answers (6)

unbeknown
unbeknown

Reputation:

I haven't ever used unicode for identifier names. But what comes to my mind is that Python allows unicode identifiers in version 3: PEP 3131.

Another language that makes extensive use of unicode is Fortress.

Even if you decide not to use unicode the problem resurfaces when you use a library that does. So you have to live with it to a certain extend.

Upvotes: 0

Esteban Küber
Esteban Küber

Reputation: 36852

It depends on the language you're using. In Python, for example, is easierfor me to stick to unicode, as my aplications needs to work in several languages. So when I get a file from someone (something) that I don't know, I assume Latin-1 and translate to Unicode.

Works for me, as I'm in latin-america.

Actually, once everithing is ironed out, the whole thing becomes a smooth ride.

Of course, this depends on the language of choice.

Upvotes: 0

Windows programmer
Windows programmer

Reputation: 8065

I think it is not a good idea to use the entire ANSI character set for identifiers. No matter which ANSI code page you're working in, your ANSI code page includes characters that some other ANSI code pages don't include. So I recommend sticking to ASCII, no character codes higher than 127.

In experiments I have used a wider range of ANSI characters than just ASCII, even in identifiers. Some compilers accepted it. Some IDEs needed options to be set for fonts that could display the characters. But I don't recommend it for practical use.

Now on to the difference between ANSI code pages and Unicode.

In experiments I have stored source files in Unicode and used Unicode characters in identifiers. Some compilers accepted it. But I still don't recommend it for practical use.

Sometimes I have stored source files in Unicode and used escape sequences in some strings to represent Unicode character values. This is an important practice and I recommend it highly. I especially had to do this when other programmers used ANSI characters in their strings, and their ANSI code pages were different from other ANSI code pages, so the strings were corrupted and caused compilation errors or defective results. The way to solve this is to use Unicode escape sequences.

Upvotes: 3

hayalci
hayalci

Reputation: 4119

I would also recommend using ascii for identifiers. Comments can stay in a non-english language if the editor/ide/compiler etc. are all locale aware and set up to use the same encoding.

Additionally, some case insensitive languages change the identifiers to lowercase before using, and that causes problems if active system locale is Turkish or Azerbaijani . see here for more info about Turkish locale problem. I know that PHP does this, and it has a long standing bug.

This problem is also present in any software that compares strings using Turkish locales, not only the language implementations themselves, just to point out. It causes many headaches

Upvotes: 1

MusiGenesis
MusiGenesis

Reputation: 75366

My experience with using unicode in C# source files was disastrous, even though it was Japanese (so there was nothing to confuse with an "i"). Source Safe doesn't like unicode, and when you find yourself manually fixing corrupted source files in Word you know something isn't right.

I think your ANSI-only policy is excellent. I can't really see any reason why that would not be viable (as long as most of your developers are English, and even if they're not the world is used to the ANSI character set).

Upvotes: 6

Vinko Vrsalovic
Vinko Vrsalovic

Reputation: 340326

Besides the similar character bugs you mention and the technical issues that might arise when using different editors (w/BOM, wo/BOM, different encodings in the same file by copy pasting which is only a problem when there are actually characters that cannot be encoded in ASCII and so on), I find that it's not worth using Unicode characters in identifiers. English has become the lingua franca of development and you should stick to it while writing code.

This I find particularly true for code that may be seen anywhere in the world by any developer (open source, or code that is sold along with the product).

Upvotes: 10

Related Questions