Lazar Ljubenović
Lazar Ljubenović

Reputation: 19764

RegExp for furigana (Japanese)

I'm trying to create regex that will remove furigana (ruby) from Japanese words:

<ruby><rb>二度</rb><rp>(</rp><rt>にど</rt><rp>)</rp>と</ruby> //old string
二度と // new string

I created new = old.replace(/<rt>.*<\/rt>/,'').replace(/<rp>.*<\/rp>/,'').replace('<ruby><rb>','').replace('</rb></ruby>','') and it works... almost.

When there are multiple ruby tags, it doesn't work at desired:

<ruby><rb>息</rb><rp>(</rp><rt>いき</rt><rp>)</rp></ruby>を<ruby><rb>切</rb><rp>(</rp><rt>き</rt><rp>)</rp></ruby>らして
息らして //new string, using function above (wrong)
息を切らして //should be this

I'm very new to RegExp, so I'm not sure how to handle this one.

Upvotes: 0

Views: 932

Answers (1)

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89584

Try to use

var newstring = oldstring.replace(/<rb>([^<]*)<\/rb>|<rp>[^<]*<\/rp>|<rt>[^<]*<\/rt>|<\/?ruby>/g, "$1");

The idea here is to capture rb tags content to put it in replacement pattern, rp and rt tags are removed with their content, and ruby tags are removed too.

Content between tags is described with [^<] (all that is not a <) since these tags (rb, rp, rt) can't be nested.

Upvotes: 1

Related Questions