Lance Pollard
Lance Pollard

Reputation: 79420

Where to find all 137,929 named Unicode characters in 12.1 in downloadable format

I have downloaded 12.1.0 unicode data, and in the file UnicodeData.txt there are only 32,841 lines, so only ~30k characters. Wondering where the other 105,088 characters are, I haven't been able to find them. Wondering if they are somewhere in Unihan.zip, or if they are somewhere in the UCD.zip. I can't seem to find this information here.

Wondering what files I use to end up with a database of all the named characters.

Upvotes: 3

Views: 320

Answers (2)

Mark Tolonen
Mark Tolonen

Reputation: 177971

@CraigBarnes is correct that UnicodeData.txt contains all the characters. Here's some proof (Python code):

import csv

D = {}

with open('UnicodeData.txt',encoding='utf-8-sig') as f:
    r = csv.reader(f,delimiter=';')
    for line in r:
        # Count all the CJK Ideograph and Hangul Syllable ranges and generate names
        if ('Ideograph' in line[1] or line[1].startswith('<Hangul')) and line[1].endswith('First>'):
            end = next(r)
            for i in range(int(line[0],16),int(end[0],16)+1):
                D[i] = [line[1][1:-8].upper() + '-' + f'{i:04X}'] + line[2:]
        elif line[1][0] == '<':
            continue # skip private use and control characters
        else:
            D[int(line[0],16)] = line[1:] # count everything else as one entry

print(len(D))

Output:

137929

Upvotes: 4

Craig Barnes
Craig Barnes

Reputation: 147

Some of the entries in UnicodeData.txt are character ranges, as described in the the technical report:

For backward compatibility, ranges in the file UnicodeData.txt are specified by entries for the start and end characters of the range, rather than by the form "X..Y". The start character is indicated by a range identifier, followed by a comma and the string "First", in angle brackets. This entry takes the place of a regular character name in field 1 for that line. The end character is indicated on the next line with the same range identifier, followed by a comma and the string "Last", in angle brackets:

4E00;<CJK Ideograph, First>;Lo;0;L;;;;;N;;;;;
9FEF;<CJK Ideograph, Last>;Lo;0;L;;;;;N;;;;;

Or in other words, the number of lines in the UnicodeData.txt file isn't the same as the number of characters in the database. Some of the character ranges consist of hundreds or thousands of characters encoded in only 2 lines.

Upvotes: 7

Related Questions