Reputation: 5134
I'm trying to figure out what would be the regular expression to split a Hebrew word so I'll get an array of letters/numbers/symbols in that word.
I don't know Hebrew but this what I have now (used with java.util.regex.Pattern.find
on each word):
(?:(?:\p{L}+|\p{N}+)[^\p{L}\p{N}]*|[^\p{L}\p{N}]+)
As the text I'm using the Genesis Book of the Holy Bible (Genesis.xml
from http://www.tanach.us/Pages/Technical.html#Offline).
UPDATE
I've changed regex to a much simpler which seems to be working fine.
\p{L}[^\p{L}]*
However if someone knows Hebrew and can tell that this is a correct approach or not that would be helpful.
For example:
Input:
בְּ/רֵאשִׁ֖ית
בָּרָ֣א
אֱלֹהִ֑ים
אֵ֥ת
הַ/שָּׁמַ֖יִם
וְ/אֵ֥ת
הָ/אָֽרֶץ׃
Output:
"בְּ/"
"רֵ"
"א"
"שִׁ֖"
"י"
"ת"
"בָּ"
"רָ֣"
"א"
"אֱ"
"לֹ"
"הִ֑"
"י"
"ם"
"אֵ֥"
"ת"
"הַ/"
"שָּׁ"
"מַ֖"
"יִ"
"ם"
"וְ/"
"אֵ֥"
"ת"
"הָ/"
"אָֽ"
"רֶ"
"ץ׃"
Upvotes: 0
Views: 497
Reputation: 21
Maybe you can use this pattern:
String[] allWords = doc.split("[^א-ת']+");
Try to change the order of the hewbrew letters, first the Alef and then the Tav
Upvotes: 2