How can I remove Japanese characters from a string?

Question

As per the title, I'd like to create a list of unicode in Python without having to type/copy-paste every unicode character I want.

For context, I want to delete all the Japanese characters from a string. (This includes kanji, hiragana, and katakana.) This string changes, so the specific Japanese characters inside of it will change, too. Currently, I am trying to do this by creating a list of all Japanese characters and then a for loop that replaces those characters with an empty string.

My code goes as follows, with str_compounds as the string and banned_ja_char as the forthcoming list of Japanese characters:

sub_list = [banned_ja_char, "【", "】", ",", "(", ")", "e.g.", "etc."]

# removing substring list from string
for sub in sub_list:
    str_compounds = str_compounds.replace(sub, '')

What methods could I use to remove all Japanese characters in the str_compounds string? Is it possible to compile a list of all Japanese characters (in unicode or otherwise) without just having a huge list of every individual unicode character? Alternatively, is there something similar to the range() function that I could use in this situation?

Tom McLean · Accepted Answer

You can replace japanese characters with regex:

import re

s = """The Japanese writing system consists of two types of characters:
the syllabic kana – hiragana (平仮名) and katakana (片仮名) – and kanji (漢字)"""

hiragana = re.compile('[\u3040-\u309F]')
katakana = re.compile('[\u30A0-\u30FF]')
CJK = re.compile('[\u4300-\u9faf]')

s = hiragana.sub('', s)
s = katakana.sub('', s)
s = CJK.sub('', s)

result:

The Japanese writing system consists of two types of characters:
the syllabic kana – hiragana () and katakana () – and kanji ()

How can I remove Japanese characters from a string?

Answers (1)

Related Questions