Reputation: 3397
Trying to figure out how to do a regular expression (or any method for that matter) that will best do the following:
Search a string for Japanese characters (hiragana, katakana, and kanji).
Wrap an uninterrupted substring of Japanese characters with a tag. For example もち and 名前はBenさん would yield the following:
<span lang="ja">もち</span>
<span lang="ja">名前は</span>Ben<span lang="ja">さん</span>
Does this globally within the string.
Upvotes: 1
Views: 2209
Reputation: 434785
I think you should be able to use:
gsub(/([\p{Hiragana}\p{Katakana}\p{Han}]+)/) { %Q{<span lang="ja">#{$1}</span>} }
For example:
'さ名前はBenさんx⽫⽬ㇰ'.gsub(/([\p{Hiragana}\p{Katakana}\p{Han}]+)/) { %Q{<span lang="ja">#{$1}</span>} }
produces:
<span lang="ja">さ名前は</span>Ben<span lang="ja">さん</span>x<span lang="ja">⽫⽬ㇰ</span>
Han should cover all the Kanji but it might include Chinese characters that aren't used in Japanese as well (sorry, it has been a couple decades since I've had to deal with Japanese on this level and I still don't know Japanese).
There are other characters (such as ㋀
) that might appear in Japanese text that aren't covered by Hirigana, Katakana, or Han/Kanji so you might need to expand the character class with some hex ranges depending on the exact nature of the text you're dealing with and what you want to do with outliers such as ㋀
.
Upvotes: 9