Reputation: 308
I'm making a web crawler using python scrapy to collect text from websites.
I only want to collect Japanese Hiragana text. Is there a solution to detect Japanese Hiragana text?
Upvotes: 3
Views: 2271
Reputation: 623
See also handy Regex \p that is albe to detect the various alphabets:
https://stackoverflow.com/a/30100900/3944480
import regex as re
pattern = re.compile(r'([\p{IsHan}\p{IsBopo}\p{IsHira}\p{IsKatakana}]+)', re.UNICODE)
input = u'sdf344asfasf天地方益3権sdfsdf'
output = pattern.sub(r'(\1)', input)
print output # Prints: sdf344asfasf(天地方益)3(権)sdfsdf
Upvotes: 0
Reputation: 3298
Assuming you only need Hiragana, and you can convert your text to unicode / utf8:
Hiragana is Unicode code block U+3040 - U+309F, so you could test it with:
def char_is_hiragana(c) -> bool:
return u'\u3040' <= c <= u'\u309F'
def string_is_hiragana(s: str) -> bool:
return all(char_is_hiragana(c) for c in s)
print('ぁ', string_is_hiragana('ぁ'))
print('ひらがな', string_is_hiragana('ひらがな'))
print('a', string_is_hiragana('a'))
print('english', string_is_hiragana('english'))
ぁ True
ひらがな True
a False
english False
But note that this excludes historic and non-standard hiragana (hentaigana), whitespace, punctuation, Katakana and Kanji:
# hiragana
print('ひらがな', string_is_hiragana('ひらがな'))
# katakana
print('カタカナ', string_is_hiragana('カタカナ'))
# kanji
print('漢字', string_is_hiragana('漢字'))
# punctuation
print('ひらがなもじ「ゆ」', string_is_hiragana('ひらがな「ゆ」'))
print('いいひと。', string_is_hiragana('いいひと。'))
ひらがな True
カタカナ False
漢字 False
ひらがなもじ「ゆ」 False
いいひと。 False
You could allow Whitespace:
import string
def string_is_hiragana_or_whitespace(s: str) -> bool:
return all(c in string.whitespace or char_is_hiragana(c) for c in s)
print('ひらがな ひらがな', string_is_hiragana_or_whitespace('ひらがな ひらがな'))
ひらがな ひらがな True
But I would avoid going down this path of being too specific, there are a lot of difficult problems, like encoding, half-width characters, emoji, CJK code blocks, loan words, etc.
Upvotes: 3
Reputation: 274
One option is the langdetect library.
pip install langdetect
Then in your code:
from langdetect import detect
detect("ハローワールド")
Will return the language code of the text i.e. ja
Japanese text tends to be a mix of hiragana, katakana and kanji though. Does it need to specifically identify hiragana?
Upvotes: 2