lribinik
lribinik

Reputation: 169

How to capture Hebrew with regex in Java?

I'm trying to catch a section of Hebrew text (the origin is comments on a news site) using the following regex:

[\u0590-\u05FF \\p{Graph} \\s]+

It works for most comments but some comments are missed.

I've tried to debug this and it seems there's a Hebrew letter that doesn't match the pattern.

When I extract this letter and print it's integer value it seems to be correct but still the regex doesn't catch it...

Ideas?

Upvotes: 6

Views: 2200

Answers (1)

kirilloid
kirilloid

Reputation: 14304

It would be more sematically correct to use \p{InHebrew} instead of \u0590-\u05FF

Also you need to match punctuation, digits (at least, world-common ones) and different kind of spaces. I don't know what is \p{Graph} and are there any Hebrew-specific punctuation symbols, but it seemed, you missed some parts.

Upvotes: 1

Related Questions