Reputation: 847
I need to convert a guid into 19 or fewer characters that can be converted back into the exact same guid.
The closest encoding I have been able to find that actually has formal documentation and looks like what I need is this base 85 encoding. It uses 85 of the "safe" characters from the first 128 character ASCII set and it brings any guid to 20 characters which is the best you can get without using the extended ascii range.
That being said; I need to know if there is a formal encoding, for some extended ASCII set, that is base 107 or more because that is the minimum number of symbols needed for fitting any guid into 19 characters.
(x19-1) ≥ (1632-1) : x must be somewhere over 107
Note: I could easily come up with my own conversion but I would like to know if there is a standardized algorithm that will solve the problem.
Upvotes: 1
Views: 172
Reputation: 12324
A quick web search hasn't turned up any useful encoding standards. And even if there were any, your additional requirement of the characters being easily distinguishable by humans would probably be hard to meet. There are plenty of characters, even in the standard set, that look similar or may cause confusion, like single and double quotes, the different widths of dashes, or the many different diacritics like ó, ò, ô, õ, ö and ø.
These 140 can probably be distinguished without problems when displayed in a large well-chosen font:
0 1 2 3 4 5 6 7 8 9
a b c d e f g h i j k l m n o p q r s t u v w x y z
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
! " # $ % & ( ) * + , - . / : ; < = > ? @ [ \ ] ^ { | } ~
€ ‡ ‰ • ™ ¢ £ ¤ ¥ § © ¬ ® ¯ ° ± ² ³ ¶ ¹ ¼ ½ ¾ ¿ ÷
Š Œ Ž š œ ž µ Æ Ç Ð Ñ æ ç ñ Ÿ Ã Ê Õ Û ÿ ã ê õ û
If you had to remove the characters which may cause technical problems, e.g. when displayed as part of html, or entered into web forms, that would be:
" % & < > \
If you wanted to remove characters that are difficult or confusing to describe over the phone, that would be e.g.:
‡ ‰ ¤ ¬ ¯ µ ¶ ÷ Ð Œ Æ æ œ
If you wanted to remove characters that may be difficult to identify or distinguish in some (small) fonts, that would be e.g;:
• ™ ® ³ ¹ ¼ ¾ Ç ç |
Then there are problems you face with ordinary text too, like:
l versus I
O versus 0
So a safe set of the most easily distinguishable characters could be e.g.:
1 2 3 4 5 6 7 8 9 (no zero)
a b c d e f g h i j k m n o p q r s t u v w x y z (no 'l')
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
! # $ ( ) * + , - . / : ; = ? @ [ ] ^ { } ~
€ ¢ £ ¥ § © ° ± ² µ ½ ¿
ã Ã ê Ê ñ Ñ õ Õ š Š û Û ÿ Ÿ ž Ž
There are only 110 characters left in this set, so you can still delete one or two if you think they're unclear in a small font, or too similar to each other, or difficult to describe or remember, but as you see, there isn't actually that much of a choice.
I have to add that recognizing characters is probably culture-dependent. I would expect a French person to easily see the difference between é, è and ê, while to an English speaker all three may look like "an e with an accent on top". That's also why I didn't select any version of 'i' with a diacritic; if you're not expecting different versions of the 'i', because your language doesn't use them, it's easy to confuse the diacritic with a standard dotted 'i'.
Also note that there are different versions of the "Latin-1" character set: the original ISO 8859-1 from 1987, the ISO 8859-15 update from 1999 which added e.g. the Euro sign, and Windows-1252 (also known as ISO-8859-1) which is now used as the default when "Latin-1" is specified in an HTML5 document, and which I used in the example above.
Upvotes: 3