Regex detecting all Japanese characters in a string and wrapping the substrings in tags

Question

Trying to figure out how to do a regular expression (or any method for that matter) that will best do the following:

Search a string for Japanese characters (hiragana, katakana, and kanji).

Wrap an uninterrupted substring of Japanese characters with a tag. For example もち and 名前はBenさん would yield the following:

もち
名前はBenさん

Does this globally within the string.

mu is too short · Accepted Answer

I think you should be able to use:

gsub(/([\p{Hiragana}\p{Katakana}\p{Han}]+)/) { %Q{#{$1}} }

For example:

'さ名前はBenさんx⽫⽬ㇰ'.gsub(/([\p{Hiragana}\p{Katakana}\p{Han}]+)/) { %Q{#{$1}} }

produces:

さ名前はBenさんx⽫⽬ㇰ

Han should cover all the Kanji but it might include Chinese characters that aren't used in Japanese as well (sorry, it has been a couple decades since I've had to deal with Japanese on this level and I still don't know Japanese).

There are other characters (such as ㋀) that might appear in Japanese text that aren't covered by Hirigana, Katakana, or Han/Kanji so you might need to expand the character class with some hex ranges depending on the exact nature of the text you're dealing with and what you want to do with outliers such as ㋀.

Regex detecting all Japanese characters in a string and wrapping the substrings in tags

Answers (1)

Related Questions