Reputation: 5942
I would like to create a bidirectional mapping of Unicode characters to the characters [a-z] and [0-9]. I thought about using the Unicode character names like Left Curly Bracket for {. Unfortunately I couldn't find a list of all UTF-8 characters with their character descriptions already prepared to be accessed within Ruby. The Wikipedia contains a list of Unicode characters and there is a Unicode name list by the Unicode Consortium. Before I start writing a parser for the list, I wanted to ask:
Upvotes: 2
Views: 1508
Reputation: 9177
The uniscribe gem does what you are asking for and it works with data from the current Unicode version. From Ruby, you can use it like this:
require "uniscribe/kernel_method"
uniscribe "𝕸𝖎𝖘𝖈 𝖀𝖓𝖎𝖈𝖔𝖉𝖊 𝕮𝖍𝖆𝖗𝖆𝖈𝖙𝖊𝖗𝖘"
which will result in the following output:
1D578 ├─ 𝕸 ├─ MATHEMATICAL BOLD FRAKTUR CAPITAL M
1D58E ├─ 𝖎 ├─ MATHEMATICAL BOLD FRAKTUR SMALL I
1D598 ├─ 𝖘 ├─ MATHEMATICAL BOLD FRAKTUR SMALL S
1D588 ├─ 𝖈 ├─ MATHEMATICAL BOLD FRAKTUR SMALL C
0020 ├─ ] [ ├─ SPACE
1D580 ├─ 𝖀 ├─ MATHEMATICAL BOLD FRAKTUR CAPITAL U
1D593 ├─ 𝖓 ├─ MATHEMATICAL BOLD FRAKTUR SMALL N
1D58E ├─ 𝖎 ├─ MATHEMATICAL BOLD FRAKTUR SMALL I
1D588 ├─ 𝖈 ├─ MATHEMATICAL BOLD FRAKTUR SMALL C
1D594 ├─ 𝖔 ├─ MATHEMATICAL BOLD FRAKTUR SMALL O
1D589 ├─ 𝖉 ├─ MATHEMATICAL BOLD FRAKTUR SMALL D
1D58A ├─ 𝖊 ├─ MATHEMATICAL BOLD FRAKTUR SMALL E
0020 ├─ ] [ ├─ SPACE
1D56E ├─ 𝕮 ├─ MATHEMATICAL BOLD FRAKTUR CAPITAL C
1D58D ├─ 𝖍 ├─ MATHEMATICAL BOLD FRAKTUR SMALL H
1D586 ├─ 𝖆 ├─ MATHEMATICAL BOLD FRAKTUR SMALL A
1D597 ├─ 𝖗 ├─ MATHEMATICAL BOLD FRAKTUR SMALL R
1D586 ├─ 𝖆 ├─ MATHEMATICAL BOLD FRAKTUR SMALL A
1D588 ├─ 𝖈 ├─ MATHEMATICAL BOLD FRAKTUR SMALL C
1D599 ├─ 𝖙 ├─ MATHEMATICAL BOLD FRAKTUR SMALL T
1D58A ├─ 𝖊 ├─ MATHEMATICAL BOLD FRAKTUR SMALL E
1D597 ├─ 𝖗 ├─ MATHEMATICAL BOLD FRAKTUR SMALL R
1D598 ├─ 𝖘 ├─ MATHEMATICAL BOLD FRAKTUR SMALL S
Under the hood it uses unicode-name and unicode-sequence_name which can also be used directly.
Upvotes: 2
Reputation: 5942
Based on the suggestion from ovhaag to use the Unicode Utils gem, I came up with the following solution which is working for me:
require 'unicode_utils'
string = %Q|Testing «ταБЬℓσ»: 1<2 & 4+1>3, now 20% off!|
mapping = string.chars.collect {|c| UnicodeUtils.char_name(c).downcase}
name_to_byte = UnicodeUtils::NAME_MAP.invert
bytes = mapping.collect {|c| name_to_byte[c.upcase]}
new_string = bytes.pack('U*')
string==new_string
Upvotes: 0
Reputation: 1278
You can try the unicode utils gem
require "unicode_utils/char_name"
UnicodeUtils.char_name "ᾀ" => "GREEK SMALL LETTER ALPHA .."
For Alternatives look in The Ruby Toolbox for "unicode .."
The unicode gem looks promising too
Unicode::decompose(str)
Upvotes: 2