Reputation: 439
Such as:
str = 'sdf344asfasf天地方益3権sdfsdf'
Add ()
to Chinese and Japanese Characters:
strAfterConvert = 'sdfasfasf(天地方益)3(権)sdfsdf'
Upvotes: 25
Views: 28762
Reputation: 477
This basic approach worked for me in Python 3, just to find CJK or other characters:
import unicodedata
def has_unicode_group(text):
for char in text:
for name in ('CJK','CHINESE','KATAKANA','HANGUL',):
if name in unicodedata.name(char):
return True
return False
Names can be found here: https://www.unicode.org/Public/15.0.0/ucd/UnicodeData.txt
Upvotes: 0
Reputation: 4812
As a start, you can check if the character is in one of the following unicode blocks:
After that, all you need to do is iterate through the string, checking if the char is Chinese, Japanese or Korean (CJK) and append accordingly:
# -*- coding:utf-8 -*-
ranges = [
{"from": ord(u"\u3300"), "to": ord(u"\u33ff")}, # compatibility ideographs
{"from": ord(u"\ufe30"), "to": ord(u"\ufe4f")}, # compatibility ideographs
{"from": ord(u"\uf900"), "to": ord(u"\ufaff")}, # compatibility ideographs
{"from": ord(u"\U0002F800"), "to": ord(u"\U0002fa1f")}, # compatibility ideographs
{'from': ord(u'\u3040'), 'to': ord(u'\u309f')}, # Japanese Hiragana
{"from": ord(u"\u30a0"), "to": ord(u"\u30ff")}, # Japanese Katakana
{"from": ord(u"\u2e80"), "to": ord(u"\u2eff")}, # cjk radicals supplement
{"from": ord(u"\u4e00"), "to": ord(u"\u9fff")},
{"from": ord(u"\u3400"), "to": ord(u"\u4dbf")},
{"from": ord(u"\U00020000"), "to": ord(u"\U0002a6df")},
{"from": ord(u"\U0002a700"), "to": ord(u"\U0002b73f")},
{"from": ord(u"\U0002b740"), "to": ord(u"\U0002b81f")},
{"from": ord(u"\U0002b820"), "to": ord(u"\U0002ceaf")} # included as of Unicode 8.0
]
def is_cjk(char):
return any([range["from"] <= ord(char) <= range["to"] for range in ranges])
def cjk_substrings(string):
i = 0
while i<len(string):
if is_cjk(string[i]):
start = i
while is_cjk(string[i]): i += 1
yield string[start:i]
i += 1
string = "sdf344asfasf天地方益3権sdfsdf".decode("utf-8")
for sub in cjk_substrings(string):
string = string.replace(sub, "(" + sub + ")")
print string
The above prints
sdf344asfasf(天地方益)3(権)sdfsdf
To be future-proof, you might want to keep a lookout for CJK Unified Ideographs Extension E. It will ship with Unicode 8.0, which is scheduled for release in June 2015. I've added it to the ranges, but you shouldn't include it until Unicode 8.0 is released.
[EDIT]
Added CJK compatibility ideographs, Japanese Kana and CJK radicals.
Upvotes: 31
Reputation: 122142
From one of the bleeding edge branch of NLTK
inspired by the Moses Machine Translation Toolkit:
def is_cjk(character):
""""
Checks whether character is CJK.
>>> is_cjk(u'\u33fe')
True
>>> is_cjk(u'\uFE5F')
False
:param character: The character that needs to be checked.
:type character: char
:return: bool
"""
return any([start <= ord(character) <= end for start, end in
[(4352, 4607), (11904, 42191), (43072, 43135), (44032, 55215),
(63744, 64255), (65072, 65103), (65381, 65500),
(131072, 196607)]
])
For the specifics of the ord()
numbers:
class CJKChars(object):
"""
An object that enumerates the code points of the CJK characters as listed on
http://en.wikipedia.org/wiki/Basic_Multilingual_Plane#Basic_Multilingual_Plane
This is a Python port of the CJK code point enumerations of Moses tokenizer:
https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/detokenizer.perl#L309
"""
# Hangul Jamo (1100–11FF)
Hangul_Jamo = (4352, 4607) # (ord(u"\u1100"), ord(u"\u11ff"))
# CJK Radicals Supplement (2E80–2EFF)
# Kangxi Radicals (2F00–2FDF)
# Ideographic Description Characters (2FF0–2FFF)
# CJK Symbols and Punctuation (3000–303F)
# Hiragana (3040–309F)
# Katakana (30A0–30FF)
# Bopomofo (3100–312F)
# Hangul Compatibility Jamo (3130–318F)
# Kanbun (3190–319F)
# Bopomofo Extended (31A0–31BF)
# CJK Strokes (31C0–31EF)
# Katakana Phonetic Extensions (31F0–31FF)
# Enclosed CJK Letters and Months (3200–32FF)
# CJK Compatibility (3300–33FF)
# CJK Unified Ideographs Extension A (3400–4DBF)
# Yijing Hexagram Symbols (4DC0–4DFF)
# CJK Unified Ideographs (4E00–9FFF)
# Yi Syllables (A000–A48F)
# Yi Radicals (A490–A4CF)
CJK_Radicals = (11904, 42191) # (ord(u"\u2e80"), ord(u"\ua4cf"))
# Phags-pa (A840–A87F)
Phags_Pa = (43072, 43135) # (ord(u"\ua840"), ord(u"\ua87f"))
# Hangul Syllables (AC00–D7AF)
Hangul_Syllables = (44032, 55215) # (ord(u"\uAC00"), ord(u"\uD7AF"))
# CJK Compatibility Ideographs (F900–FAFF)
CJK_Compatibility_Ideographs = (63744, 64255) # (ord(u"\uF900"), ord(u"\uFAFF"))
# CJK Compatibility Forms (FE30–FE4F)
CJK_Compatibility_Forms = (65072, 65103) # (ord(u"\uFE30"), ord(u"\uFE4F"))
# Range U+FF65–FFDC encodes halfwidth forms, of Katakana and Hangul characters
Katakana_Hangul_Halfwidth = (65381, 65500) # (ord(u"\uFF65"), ord(u"\uFFDC"))
# Supplementary Ideographic Plane 20000–2FFFF
Supplementary_Ideographic_Plane = (131072, 196607) # (ord(u"\U00020000"), ord(u"\U0002FFFF"))
ranges = [Hangul_Jamo, CJK_Radicals, Phags_Pa, Hangul_Syllables,
CJK_Compatibility_Ideographs, CJK_Compatibility_Forms,
Katakana_Hangul_Halfwidth, Supplementary_Ideographic_Plane]
Combining the is_cjk()
in this answer and @EvenLisle substring answer
>>> from nltk.tokenize.util import is_cjk
>>> text = u'sdf344asfasf天地方益3権sdfsdf'
>>> [1 if is_cjk(ch) else 0 for ch in text]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0]
>>> def cjk_substrings(string):
... i = 0
... while i<len(string):
... if is_cjk(string[i]):
... start = i
... while is_cjk(string[i]): i += 1
... yield string[start:i]
... i += 1
...
>>> string = "sdf344asfasf天地方益3権sdfsdf".decode("utf-8")
>>> for sub in cjk_substrings(string):
... string = string.replace(sub, "(" + sub + ")")
...
>>> string
u'sdf344asfasf(\u5929\u5730\u65b9\u76ca)3(\u6a29)sdfsdf'
>>> print string
sdf344asfasf(天地方益)3(権)sdfsdf
Upvotes: 10
Reputation: 414585
If you can't use regex
module that provides access to IsKatakana
, IsHan
properties as shown in @一二三's answer; you could use character ranges from @EvenLisle's answer with stdlib's re
module:
>>> import re
>>> print(re.sub(u"([\u3300-\u33ff\ufe30-\ufe4f\uf900-\ufaff\U0002f800-\U0002fa1f\u30a0-\u30ff\u2e80-\u2eff\u4e00-\u9fff\u3400-\u4dbf\U00020000-\U0002a6df\U0002a700-\U0002b73f\U0002b740-\U0002b81f\U0002b820-\U0002ceaf]+)", r"(\1)", u'sdf344asfasf天地方益3権sdfsdf'))
sdf344asfasf(天地方益)3(権)sdfsdf
Beware of known issues.
You could also check Unicode category:
>>> import unicodedata
>>> unicodedata.category(u'天')
'Lo'
>>> unicodedata.category(u's')
'Ll'
Upvotes: 5
Reputation: 21249
You can do the edit using the regex
package, which supports checking the Unicode "Script" property of each character and is a drop-in replacement for the re
package:
import regex as re
pattern = re.compile(r'([\p{IsHan}\p{IsBopo}\p{IsHira}\p{IsKatakana}]+)', re.UNICODE)
input = u'sdf344asfasf天地方益3権sdfsdf'
output = pattern.sub(r'(\1)', input)
print output # Prints: sdf344asfasf(天地方益)3(権)sdfsdf
You should adjust the \p{Is...}
sequences with the character scripts/blocks that you consider to be "Chinese or Japanese".
Upvotes: 24