Basj
Basj

Reputation: 46503

Official repository of Unicode character names

There are a few ways to get the list of all Unicode characters' names: for example using Python module unicodedata, as explained in List of unicode character names, or using the website: https://unicode.org/charts/charindex.html but here it's incomplete, and you have to open and parse PDF to find the names.

But what is the official source / repository of all Unicode character names? (such that if a new character is added, the list is updated, so I'm looking for the initial source for these names, in a machine readable format).

I'm looking for a list with just code point and name, in CSV or any other format:

code   character name
...
0102   LATIN CAPITAL LETTER A WITH BREVE
0103   LATIN SMALL LETTER A WITH BREVE
...

Upvotes: 6

Views: 1183

Answers (2)

dawg
dawg

Reputation: 103884

The CSV file located at

https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt

Has data for each named code point in the format that looks like this:

0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061;

If you want to parse the latest database of Unicode character names, here is a Ruby to do that:

#!/usr/bin/env ruby

require 'net/http'

uri = URI('https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt')
txt = Net::HTTP.get(uri)
txt.split(/\R/).each{|line| 
    fields=line.split(/;/)
    if fields[1][/<[^>]*>/]
        lf=fields[-1][/^N$/] ? "" : fields[-1]
        puts "#{fields[0]} #{fields[1]} #{lf}"
    else
        puts "#{fields[0]} #{fields[1]}"
    end    
    }

Or a curl and awk pipe:

awk -F";" '
{   sub(/;*$/,""); $1=$1
    if ($2~"^<.*>$") 
        printf "%s %s %s\n", $1, $2, ($NF~"^N$") ? "" : $NF
    else
        printf "%s %s\n", $1, $2
}' <(curl -s "https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt") 

Prints:

0000 <control> NULL
0001 <control> START OF HEADING
0002 <control> START OF TEXT
...
0041 LATIN CAPITAL LETTER A
0042 LATIN CAPITAL LETTER B
0043 LATIN CAPITAL LETTER C
...
00C0 LATIN CAPITAL LETTER A WITH GRAVE
00C1 LATIN CAPITAL LETTER A WITH ACUTE
00C2 LATIN CAPITAL LETTER A WITH CIRCUMFLEX
00C3 LATIN CAPITAL LETTER A WITH TILDE
...

Upvotes: 1

Joachim Sauer
Joachim Sauer

Reputation: 308061

The official source for the actual character data (which includes the character names and many, many other details) is the Unicode Character Database.

The latest version of the data files can be accessed via http://www.unicode.org/Public/UCD/latest/.

Names specifically can be found in the files NamesList.txt. The format of that file is described here.

This is the list in CSV format: https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt

Upvotes: 10

Related Questions