Unknown Id
Unknown Id

Reputation: 460

RegEx pattern with unusual unicode character and word boundaries

I'm stuck with a problem concerning RegEx patterns and I hope somebody would explain it to me:

The task is to match object names and remove them from a description that's stored in one of the object's field. I tried the following expression:

    final String description= object.getDescrition();
    final Matcher descriptionMatcher=
        Pattern.compile("\\b" + object.getName() + "\\b", Pattern.UNICODE_CASE | Pattern.CASE_INSENSITIVE)
            .matcher(description);

All works fine until the code encounters a "registered trademark" symbol added to the name: String name = ObjectName®

If I remove the last word boundary, it is matched again. What is the reason for this behaviour and how can I improve this code to possibly find every such special case?

Note: the trademark sign is not separated from the object name via space.

Upvotes: 0

Views: 280

Answers (2)

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89565

In this case, change your pattern to:

"\\b\\Q" + object.getName() + "\\E(?<=\\b|®)"

if you need to deal with more complex cases, use alternations in lookarounds instead of word boundaries. Example:

"(?<=\\s|^)\\Q" + object.getName() + "\\E(?=\\s|$)"

or

"(?<=\\s|^)" + Pattern.quote(object.getName()) + "(?=\\s|$)"

Upvotes: 0

Mena
Mena

Reputation: 48404

The ® character is not considered a word character, therefore your Pattern will not match.

A quick and dirty solution would be to alternate it with the word boundary, if you only have this case:

Pattern.compile("\\b" + object.getName() + "\\b|®"

Upvotes: 0

Related Questions