RegEx pattern with unusual unicode character and word boundaries

Question

I'm stuck with a problem concerning RegEx patterns and I hope somebody would explain it to me:

The task is to match object names and remove them from a description that's stored in one of the object's field. I tried the following expression:

    final String description= object.getDescrition();
    final Matcher descriptionMatcher=
        Pattern.compile("\b" + object.getName() + "\b", Pattern.UNICODE_CASE | Pattern.CASE_INSENSITIVE)
            .matcher(description);

All works fine until the code encounters a "registered trademark" symbol added to the name: String name = ObjectName®

If I remove the last word boundary, it is matched again. What is the reason for this behaviour and how can I improve this code to possibly find every such special case?

Note: the trademark sign is not separated from the object name via space.

Casimir et Hippolyte · Accepted Answer

In this case, change your pattern to:

"\b\Q" + object.getName() + "\E(?<=\b|®)"

if you need to deal with more complex cases, use alternations in lookarounds instead of word boundaries. Example:

"(?<=\s|^)\Q" + object.getName() + "\E(?=\s|$)"

or

"(?<=\s|^)" + Pattern.quote(object.getName()) + "(?=\s|$)"

RegEx pattern with unusual unicode character and word boundaries

Answers (2)

Related Questions