Reputation: 46503
There are a few ways to get the list of all Unicode characters' names: for example using Python module unicodedata
, as explained in List of unicode character names, or using the website: https://unicode.org/charts/charindex.html but here it's incomplete, and you have to open and parse PDF to find the names.
But what is the official source / repository of all Unicode character names? (such that if a new character is added, the list is updated, so I'm looking for the initial source for these names, in a machine readable format).
I'm looking for a list with just code point
and name
, in CSV or any other format:
code character name
...
0102 LATIN CAPITAL LETTER A WITH BREVE
0103 LATIN SMALL LETTER A WITH BREVE
...
Upvotes: 6
Views: 1183
Reputation: 103884
The CSV file located at
https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt
Has data for each named code point in the format that looks like this:
0041;LATIN CAPITAL LETTER A;Lu;0;L;;;;;N;;;;0061;
If you want to parse the latest database of Unicode character names, here is a Ruby to do that:
#!/usr/bin/env ruby
require 'net/http'
uri = URI('https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt')
txt = Net::HTTP.get(uri)
txt.split(/\R/).each{|line|
fields=line.split(/;/)
if fields[1][/<[^>]*>/]
lf=fields[-1][/^N$/] ? "" : fields[-1]
puts "#{fields[0]} #{fields[1]} #{lf}"
else
puts "#{fields[0]} #{fields[1]}"
end
}
Or a curl
and awk
pipe:
awk -F";" '
{ sub(/;*$/,""); $1=$1
if ($2~"^<.*>$")
printf "%s %s %s\n", $1, $2, ($NF~"^N$") ? "" : $NF
else
printf "%s %s\n", $1, $2
}' <(curl -s "https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt")
Prints:
0000 <control> NULL
0001 <control> START OF HEADING
0002 <control> START OF TEXT
...
0041 LATIN CAPITAL LETTER A
0042 LATIN CAPITAL LETTER B
0043 LATIN CAPITAL LETTER C
...
00C0 LATIN CAPITAL LETTER A WITH GRAVE
00C1 LATIN CAPITAL LETTER A WITH ACUTE
00C2 LATIN CAPITAL LETTER A WITH CIRCUMFLEX
00C3 LATIN CAPITAL LETTER A WITH TILDE
...
Upvotes: 1
Reputation: 308061
The official source for the actual character data (which includes the character names and many, many other details) is the Unicode Character Database.
The latest version of the data files can be accessed via http://www.unicode.org/Public/UCD/latest/.
Names specifically can be found in the files NamesList.txt
. The format of that file is described here.
This is the list in CSV format: https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt
Upvotes: 10