Reputation:
I am writing regular expressions for unicode text in Java. However for the particular script that I am using - Devanagari (0900 - 097F) there is a problem with word boundaries. \b matches characters which are dependent vowels(like 093E-094C) as they are treated like space characters.
Example: Suppose I have the string: "कमल कमाल कम्हल कम्हाल" Note that 'मा' in the 2nd word is formed by combining म and ा (recognized as a space character). Similarly in the last word. This leads \b to match the 'ल' in 'कमाल' with regular expression \b\w\b which is not correct according to the language.
I hope the example helps.
Can I write a regular expression that behaves like \b except that it doesn't match certain chars? Any feedback will be grateful.
Upvotes: 1
Views: 466
Reputation: 1324787
The equivalent for word boundaries (if the boundaries are not what you were expecting for) would be:
(?<!=[x-y])(<?=[x-y])...(?<=[x-y])(?![x-y])
That is because a "word boundary" means "a location where there is a character on one side and not on the other)
So with look-behind and look-ahead expressions, you can define you own class of characters [x-y] to check when you want to isolate a "word boundary"
Upvotes: 0
Reputation: 143204
You should be able to accomplish what you want with the following regex operators:
(?=X) X, via zero-width positive lookahead
(?!X) X, via zero-width negative lookahead
(?<=X) X, via zero-width positive lookbehind
(?<!X) X, via zero-width negative lookbehind
(The above is quoted from the Java 6 Pattern API docs.)
Use (?<![foo])(?=[foo])
in place of \b
before a word, and (?<=[foo])(?![foo])
in place of \b
after a word, where "[foo]
" is your set of "word characters"
Upvotes: 0