Reputation: 21
I'm trying to generate all unicode 16.0 characters on a file and all unicode 15.1 characters on a other file and display on a new file the added characters on unicode 16.0.
I tried this code, but this is not what im looking for, because there may be new emojis or others characters that may not be printable on unicode 15.1 but is printable on unicode 16.0 and i dont think ive generated characters correctly. Please take a look on the source code, thank you.
import os
file_15_1 = "unicode_15_1.txt"
file_16_0 = "unicode_16_0.txt"
file_new_in_16_0 = "new_in_16_0.txt"
unicode_15_1_end = 149813
unicode_16_0_end = 154998
def is_visible(char):
return char.isprintable() and not char.isspace() and char != ""
def generate_unicode_file(start, end, filename):
with open(filename, "w", encoding="utf-8") as f:
for codepoint in range(start, end + 1):
try:
f.write(chr(codepoint) + "\n")
except ValueError:
continue
generate_unicode_file(0, unicode_15_1_end, file_15_1)
generate_unicode_file(0, unicode_16_0_end, file_16_0)
def find_new_characters(file1, file2, output_file):
with open(file1, "r", encoding="utf-8") as f1, open(file2, "r", encoding="utf-8") as f2:
chars_15_1 = set(f1.read().splitlines())
chars_16_0 = set(f2.read().splitlines())
new_in_16_0 = chars_16_0 - chars_15_1 # No
with open(output_file, "w", encoding="utf-8") as f_out:
for char in sorted(new_in_16_0):
if is_visible(char):
f_out.write(char + "\n")
find_new_characters(file_15_1, file_16_0, file_new_in_16_0)
print(f"- {file_15_1}")
print(f"- {file_16_0}")
print(f"- {file_new_in_16_0}")```
Upvotes: 1
Views: 89
Reputation: 1447
Assuming you have the latest version of icu4c available, PyICU will provide a simple way to get a set of the required characters:
import icu
uset = icu.UnicodeSet(r'[[[\p{graph}]-[\p{cntrl}]]-[\p{Age=15.1}]]')
u16_chars = list(uset)
print(len(u16_chars))
# 5185
The set [[\p{graph}]-[\p{cntrl}]]
is the set of printable characters excluding blanks, i.e. excluding the subset of whitespace characters that are included in the definition of a printable character. Then remove all characters with an age property value of less than or equal to 15.1.
For character details, use the latest version of unicodedataplus, a drop in replacement for unicodedata that has additional methods and supports Unicode 16.
import unicodedataplus as ud
u16_subset = u16_chars[0:10]
for char in u16_subset:
print(
f'{ord(char):04X}',
ud.name(char),
ud.script(char),
ud.block(char),
sep='\t')
# 0897 ARABIC PEPET Arabic Arabic Extended-B
# 1B4E BALINESE INVERTED CARIK SIKI Balinese Balinese
# 1B4F BALINESE INVERTED CARIK PAREREN Balinese Balinese
# 1B7F BALINESE PANTI BAWAK Balinese Balinese
# 1C89 CYRILLIC CAPITAL LETTER TJE Cyrillic Cyrillic Extended-C
# 1C8A CYRILLIC SMALL LETTER TJE Cyrillic Cyrillic Extended-C
# 2427 SYMBOL FOR DELETE SQUARE CHECKER BOARD FORM Common Control Pictures
# 2428 SYMBOL FOR DELETE RECTANGULAR CHECKER BOARD FORM Common Control Pictures
# 2429 SYMBOL FOR DELETE MEDIUM SHADE FORM Common Control Pictures
# 31E4 CJK STROKE HXG Common CJK Strokes
Upvotes: 1
Reputation: 178021
Although the current Python 3.13 supports Unicode 15.1.0 in its unicodedata module and can identify the supported code points, that won't help you with Unicode 16.0.0. If you download the UnicodeData.txt files for each version (15.1.0, 16.0.0) you can parse them yourself for the supported characters and write them to a file; although, without a font supporting Unicode 16.0.0 you won't see much. UnicodeData.html describes the data format.
Here's an example that uses the csv module to parse the semicolon-delimited data files.
import csv
def print_file(filename, data):
with open(filename, 'w', encoding='utf-8-sig') as file:
for key, value in data.items():
code = int(key, 16)
if 0xD800 <= code <= 0xDFFF:
continue # ignore surrogates...can't be written individually
name = value[0]
print(f'{chr(code)} U+{key} {name}', file=file)
with open('Downloads/UnicodeData15.1.0.txt', encoding='ascii', newline='') as file:
reader = csv.reader(file, delimiter=';')
data15 = list(reader)
with open('Downloads/UnicodeData16.0.0.txt', encoding='ascii', newline='') as file:
reader = csv.reader(file, delimiter=';')
data16 = list(reader)
dict15 = {row[0] : row[1:] for row in data15}
dict16 = {row[0] : row[1:] for row in data16}
diff = {key : value for key, value in dict16.items() if key not in dict15}
print(f'Code points in Unicode 15.1.0: {len(dict15)}')
print(f'Code points in Unicode 16.0.0: {len(dict16)}')
print(f'New code points in Unicode 16.0.0: {len(diff)}')
print_file('unicode_15_1.txt', d15)
print_file('unicode_16_0.txt', d15)
print_file('new_in_16_0.txt', diff)
Output (along with three files)
Code points in Unicode 15.1.0: 34931
Code points in Unicode 16.0.0: 40116
New code points in Unicode 16.0.0: 5185
Example of new_in_16_0.txt
. Glyph display depends on font support. I could see the last five correctly on Windows 11 and Chrome Version 131.0.6778.265 (Official Build) (64-bit):
U+0897 ARABIC PEPET
U+1B4E BALINESE INVERTED CARIK SIKI
U+1B4F BALINESE INVERTED CARIK PAREREN
U+1B7F BALINESE PANTI BAWAK
U+1C89 CYRILLIC CAPITAL LETTER TJE
...
U+1FBEB LEFT JUSTIFIED RIGHT HALF BLACK CIRCLE
U+1FBEC TOP RIGHT JUSTIFIED LOWER LEFT QUARTER BLACK CIRCLE
U+1FBED BOTTOM LEFT JUSTIFIED UPPER RIGHT QUARTER BLACK CIRCLE
U+1FBEE BOTTOM RIGHT JUSTIFIED UPPER LEFT QUARTER BLACK CIRCLE
U+1FBEF TOP LEFT JUSTIFIED LOWER RIGHT QUARTER BLACK CIRCLE
Upvotes: 1