Reputation: 11
So I am very new with Regex and I have managed to create a way to check if a specific word exists inside of a string without just being part of another word.
Example: I am looking for the word "banana". banana == true, bananarama == false
This is all fine, however a problem occurs when I am looking for words containing Swedish letters (Å,Ä,Ö) with words containing only two letters.
Example: I am looking for the word "på" in a string looking like this: "på påsk" and it comes back as negative. However if I look for the word "påsk" then it comes back positive. This is the regex I am using:
const doesWordExist = (s, word) => new RegExp('\\b' + word + '\\b', 'i').test(s);
stringOfWords = "Färg på plagg";
console.log(doesWordExist(stringOfWords, "på"))
//Expected result: true
//Actual result: false
However if I were to change the word "på" to a three letter word then it comes back true:
const doesWordExist = (s, word) => new RegExp('\\b' + word + '\\b', 'i').test(s);
stringOfWords = "Färg pås plagg";
console.log(doesWordExist(stringOfWords, "pås"))
//Expected result: true
//Actual result: true
I have been looking around for answers and I have found a few that have similar issues with Swedish letters, none of them really look for only the word in its entirity. Could anyone explain what I am doing wrong?
Upvotes: 0
Views: 319
Reputation: 7880
The word boundary \b
strictly depends on the characters matched by \w
, which is a short-hand character class for [A-Za-z0-9_]
.
For obtaining a similar behaviour you must re-implement its functionality, for example like this:
const swedishCharClass = '[a-zäöå]';
const doesWordExist = (s, word) => new RegExp(
'(?<!' + swedishCharClass + ')' + word + '(?!' + swedishCharClass + ')', 'i'
).test(s);
console.log(doesWordExist("Färg på plagg", "på")); // true
console.log(doesWordExist("Färg pås plagg", "pås")); // true
console.log(doesWordExist("Färg pås plagg", "på")); // false
For more complex alphabets, I'd suggest you to take a look at Concrete Javascript Regex for Accented Characters (Diacritics).
Upvotes: 1