Reputation: 59
As per the title, I'd like to create a list of unicode in Python without having to type/copy-paste every unicode character I want.
For context, I want to delete all the Japanese characters from a string. (This includes kanji, hiragana, and katakana.) This string changes, so the specific Japanese characters inside of it will change, too. Currently, I am trying to do this by creating a list of all Japanese characters and then a for loop that replaces those characters with an empty string.
My code goes as follows, with str_compounds
as the string and banned_ja_char
as the forthcoming list of Japanese characters:
sub_list = [banned_ja_char, "【", "】", ",", "(", ")", "e.g.", "etc."]
# removing substring list from string
for sub in sub_list:
str_compounds = str_compounds.replace(sub, '')
What methods could I use to remove all Japanese characters in the str_compounds
string? Is it possible to compile a list of all Japanese characters (in unicode or otherwise) without just having a huge list of every individual unicode character? Alternatively, is there something similar to the range()
function that I could use in this situation?
Upvotes: 1
Views: 1018
Reputation: 6295
You can replace japanese characters with regex:
import re
s = """The Japanese writing system consists of two types of characters:
the syllabic kana – hiragana (平仮名) and katakana (片仮名) – and kanji (漢字)"""
hiragana = re.compile('[\u3040-\u309F]')
katakana = re.compile('[\u30A0-\u30FF]')
CJK = re.compile('[\u4300-\u9faf]')
s = hiragana.sub('', s)
s = katakana.sub('', s)
s = CJK.sub('', s)
result:
The Japanese writing system consists of two types of characters:
the syllabic kana – hiragana () and katakana () – and kanji ()
Upvotes: 1