cesare
cesare

Reputation: 2118

Regular expression to split words with accented characters from latin

I'm working on a html tool to study ancient latin language. There is one exercise where student have to click on some single word, in which there is a div with a piece of latin:

<div class="clickable">
                   Cum a Romanis copiis vincĭtur măr, Gallia terra fera est. 
Regionis incŏlae terram non colunt, autem sagittis feras necant et postea eas vorant. 
Etiam a_femĭnis vita agrestis agĭtur, 
miseras vestes induunt et cum familiā in parvis casis vivunt. 
Vita secūra nimiaeque divitiae a Gallis contemnuntur. 
Gallorum civitates acrĭter pugnant et ab inimicis copiis timentur. 
Galli densis silvis defenduntur, tamen Roma feram Galliam capit. 
</div>    

In my javascript I wrap all single words into a <span> with a regex, and I apply some actions.

 var words = $('div.clickable');        
    words.html(function(index, oldHtml) {
        var myText = oldHtml.replace(/\b(\w+?)\b/g, '<span class="word">$1</span>')

        return myText;
    }).click(function(event) { 
        if(!$(event.target).hasClass("word"))return; 
        alert($(event.target).text());
    }

The problem is that the words that contains ĭ, ŏ, ā, are not wrapped correctly, but are divided in correspondence of these characters.

How I can match correctly this class of words?

JS Fiddle

Upvotes: 1

Views: 665

Answers (2)

Slavik
Slavik

Reputation: 6837

You can split your text by divider. In common case it may be space or different punctuation marks:

(.+?)([\s,.!?;:)([\]]+)

https://regex101.com/r/xW4pF1/5

Edit

var words = $('div.clickable');        
words.html(function(index, oldHtml) {
    var myText = oldHtml.replace(/(.+?)([\s,.!?;:)([\]]+)/g, '<span class="word">$1</span>$2')

    return myText;
}).click(function(event) { 
    if(!$(event.target).hasClass("word"))return; 
    alert($(event.target).text());
}

https://jsfiddle.net/s568c0pp/3/

Upvotes: 4

Sergey Moukavoztchik
Sergey Moukavoztchik

Reputation: 61

The \w meta character is used to find a word character from a-z, A-Z, 0-9, including the _ (underscore) character. So you need to change your regex to use the range of Unicode symbols instead of \w.

You also can try \p{L} instead of \w to match any Unicode character.

See also: http://www.regular-expressions.info/unicode.html

Upvotes: 1

Related Questions