Florian Feldhaus
Florian Feldhaus

Reputation: 5942

How to replace UTF-8 characters with their character names in Ruby?

I would like to create a bidirectional mapping of Unicode characters to the characters [a-z] and [0-9]. I thought about using the Unicode character names like Left Curly Bracket for {. Unfortunately I couldn't find a list of all UTF-8 characters with their character descriptions already prepared to be accessed within Ruby. The Wikipedia contains a list of Unicode characters and there is a Unicode name list by the Unicode Consortium. Before I start writing a parser for the list, I wanted to ask:

Upvotes: 2

Views: 1508

Answers (3)

J-_-L
J-_-L

Reputation: 9177

The uniscribe gem does what you are asking for and it works with data from the current Unicode version. From Ruby, you can use it like this:

require "uniscribe/kernel_method"
uniscribe "𝕸𝖎𝖘𝖈 𝖀𝖓𝖎𝖈𝖔𝖉𝖊 𝕮𝖍𝖆𝖗𝖆𝖈𝖙𝖊𝖗𝖘"

which will result in the following output:

1D578 ├─ 𝕸     ├─ MATHEMATICAL BOLD FRAKTUR CAPITAL M
1D58E ├─ 𝖎     ├─ MATHEMATICAL BOLD FRAKTUR SMALL I
1D598 ├─ 𝖘     ├─ MATHEMATICAL BOLD FRAKTUR SMALL S
1D588 ├─ 𝖈     ├─ MATHEMATICAL BOLD FRAKTUR SMALL C
 0020 ├─ ] [        ├─ SPACE
1D580 ├─ 𝖀     ├─ MATHEMATICAL BOLD FRAKTUR CAPITAL U
1D593 ├─ 𝖓     ├─ MATHEMATICAL BOLD FRAKTUR SMALL N
1D58E ├─ 𝖎     ├─ MATHEMATICAL BOLD FRAKTUR SMALL I
1D588 ├─ 𝖈     ├─ MATHEMATICAL BOLD FRAKTUR SMALL C
1D594 ├─ 𝖔     ├─ MATHEMATICAL BOLD FRAKTUR SMALL O
1D589 ├─ 𝖉     ├─ MATHEMATICAL BOLD FRAKTUR SMALL D
1D58A ├─ 𝖊     ├─ MATHEMATICAL BOLD FRAKTUR SMALL E
 0020 ├─ ] [        ├─ SPACE
1D56E ├─ 𝕮     ├─ MATHEMATICAL BOLD FRAKTUR CAPITAL C
1D58D ├─ 𝖍     ├─ MATHEMATICAL BOLD FRAKTUR SMALL H
1D586 ├─ 𝖆     ├─ MATHEMATICAL BOLD FRAKTUR SMALL A
1D597 ├─ 𝖗     ├─ MATHEMATICAL BOLD FRAKTUR SMALL R
1D586 ├─ 𝖆     ├─ MATHEMATICAL BOLD FRAKTUR SMALL A
1D588 ├─ 𝖈     ├─ MATHEMATICAL BOLD FRAKTUR SMALL C
1D599 ├─ 𝖙     ├─ MATHEMATICAL BOLD FRAKTUR SMALL T
1D58A ├─ 𝖊     ├─ MATHEMATICAL BOLD FRAKTUR SMALL E
1D597 ├─ 𝖗     ├─ MATHEMATICAL BOLD FRAKTUR SMALL R
1D598 ├─ 𝖘     ├─ MATHEMATICAL BOLD FRAKTUR SMALL S

Under the hood it uses unicode-name and unicode-sequence_name which can also be used directly.

Upvotes: 2

Florian Feldhaus
Florian Feldhaus

Reputation: 5942

Based on the suggestion from ovhaag to use the Unicode Utils gem, I came up with the following solution which is working for me:

require 'unicode_utils'
string       = %Q|Testing «ταБЬℓσ»: 1<2 & 4+1>3, now 20% off!|
mapping      = string.chars.collect {|c| UnicodeUtils.char_name(c).downcase}
name_to_byte = UnicodeUtils::NAME_MAP.invert
bytes        = mapping.collect {|c| name_to_byte[c.upcase]}
new_string   = bytes.pack('U*')
string==new_string

Upvotes: 0

ovhaag
ovhaag

Reputation: 1278

You can try the unicode utils gem

require "unicode_utils/char_name"
UnicodeUtils.char_name "ᾀ" => "GREEK SMALL LETTER ALPHA .."

For Alternatives look in The Ruby Toolbox for "unicode .."

The unicode gem looks promising too

Unicode::decompose(str)

Upvotes: 2

Related Questions