Where to find all 137,929 named Unicode characters in 12.1 in downloadable format

Question

I have downloaded 12.1.0 unicode data, and in the file UnicodeData.txt there are only 32,841 lines, so only ~30k characters. Wondering where the other 105,088 characters are, I haven't been able to find them. Wondering if they are somewhere in Unihan.zip, or if they are somewhere in the UCD.zip. I can't seem to find this information here.

Wondering what files I use to end up with a database of all the named characters.

Mark Tolonen · Accepted Answer

@CraigBarnes is correct that UnicodeData.txt contains all the characters. Here's some proof (Python code):

import csv

D = {}

with open('UnicodeData.txt',encoding='utf-8-sig') as f:
    r = csv.reader(f,delimiter=';')
    for line in r:
        # Count all the CJK Ideograph and Hangul Syllable ranges and generate names
        if ('Ideograph' in line[1] or line[1].startswith(''):
            end = next(r)
            for i in range(int(line[0],16),int(end[0],16)+1):
                D[i] = [line[1][1:-8].upper() + '-' + f'{i:04X}'] + line[2:]
        elif line[1][0] == '<':
            continue # skip private use and control characters
        else:
            D[int(line[0],16)] = line[1:] # count everything else as one entry

print(len(D))

Output:

Where to find all 137,929 named Unicode characters in 12.1 in downloadable format

Answers (2)

Related Questions