Reputation: 79420
I have downloaded 12.1.0 unicode data, and in the file UnicodeData.txt
there are only 32,841 lines, so only ~30k characters. Wondering where the other 105,088 characters are, I haven't been able to find them. Wondering if they are somewhere in Unihan.zip
, or if they are somewhere in the UCD.zip
. I can't seem to find this information here.
Wondering what files I use to end up with a database of all the named characters.
Upvotes: 3
Views: 320
Reputation: 177971
@CraigBarnes is correct that UnicodeData.txt contains all the characters. Here's some proof (Python code):
import csv
D = {}
with open('UnicodeData.txt',encoding='utf-8-sig') as f:
r = csv.reader(f,delimiter=';')
for line in r:
# Count all the CJK Ideograph and Hangul Syllable ranges and generate names
if ('Ideograph' in line[1] or line[1].startswith('<Hangul')) and line[1].endswith('First>'):
end = next(r)
for i in range(int(line[0],16),int(end[0],16)+1):
D[i] = [line[1][1:-8].upper() + '-' + f'{i:04X}'] + line[2:]
elif line[1][0] == '<':
continue # skip private use and control characters
else:
D[int(line[0],16)] = line[1:] # count everything else as one entry
print(len(D))
Output:
137929
Upvotes: 4
Reputation: 147
Some of the entries in UnicodeData.txt
are character ranges, as described in the the technical report:
For backward compatibility, ranges in the file
UnicodeData.txt
are specified by entries for the start and end characters of the range, rather than by the form "X..Y". The start character is indicated by a range identifier, followed by a comma and the string "First", in angle brackets. This entry takes the place of a regular character name in field 1 for that line. The end character is indicated on the next line with the same range identifier, followed by a comma and the string "Last", in angle brackets:
4E00;<CJK Ideograph, First>;Lo;0;L;;;;;N;;;;;
9FEF;<CJK Ideograph, Last>;Lo;0;L;;;;;N;;;;;
Or in other words, the number of lines in the UnicodeData.txt
file isn't the same as the number of characters in the database. Some of the character ranges consist of hundreds or thousands of characters encoded in only 2 lines.
Upvotes: 7