Pol
Pol

Reputation: 5134

Regular expression to split on Hebrew letters

I'm trying to figure out what would be the regular expression to split a Hebrew word so I'll get an array of letters/numbers/symbols in that word.

I don't know Hebrew but this what I have now (used with java.util.regex.Pattern.find on each word):

(?:(?:\p{L}+|\p{N}+)[^\p{L}\p{N}]*|[^\p{L}\p{N}]+)

As the text I'm using the Genesis Book of the Holy Bible (Genesis.xml from http://www.tanach.us/Pages/Technical.html#Offline).

UPDATE

I've changed regex to a much simpler which seems to be working fine.

\p{L}[^\p{L}]*

However if someone knows Hebrew and can tell that this is a correct approach or not that would be helpful.

For example:

Input:

בְּ/רֵאשִׁ֖ית
בָּרָ֣א
אֱלֹהִ֑ים
אֵ֥ת
הַ/שָּׁמַ֖יִם
וְ/אֵ֥ת
הָ/אָֽרֶץ׃

Output:

"בְּ/"
"רֵ"
"א"
"שִׁ֖"
"י"
"ת"
"בָּ"
"רָ֣"
"א"
"אֱ"
"לֹ"
"הִ֑"
"י"
"ם"
"אֵ֥"
"ת"
"הַ/"
"שָּׁ"
"מַ֖"
"יִ"
"ם"
"וְ/"
"אֵ֥"
"ת"
"הָ/"
"אָֽ"
"רֶ"
"ץ׃"

Upvotes: 0

Views: 497

Answers (1)

user3349314
user3349314

Reputation: 21

Maybe you can use this pattern:

String[] allWords = doc.split("[^א-ת']+");

Try to change the order of the hewbrew letters, first the Alef and then the Tav

Upvotes: 2

Related Questions