Reputation: 58087
In Hebrew, there are certain vowels that NSPredicate fails to ignore even when using the 'd' (diacritic insensitive) modifier in the predicate. I was told that the solution is to use regular expressions to do the search.
How do I take a search string and "use regex" to search hebrew text that contains vowels, ignoring those vowels?
Edit:
In other words, If I wanted to search the following text, disregarding dashes and asterisks, how would I do so using regex?
Example Text:
I w-en*t t-o the st*o*r*-e yes-ster*day.
Edit 2:
Essentially, I want to:
Edit 3:
Here's how I'm implementing my search:
//
// The user updated the search text
//
- (BOOL)searchDisplayController:(UISearchDisplayController *)controller
shouldReloadTableForSearchString:(NSString *)searchString{
NSMutableArray *unfilteredResults = [[[[self.fetchedResultsController sections] objectAtIndex:0] objects] mutableCopy];
if (self.filteredArray == nil) {
self.filteredArray = [[[NSMutableArray alloc ] init] autorelease];
}
[filteredArray removeAllObjects];
NSPredicate *predicate;
if (controller.searchBar.selectedScopeButtonIndex == 0) {
predicate = [NSPredicate predicateWithFormat:@"articleTitle CONTAINS[cd] %@", searchString];
}else if (controller.searchBar.selectedScopeButtonIndex == 1) {
predicate = [NSPredicate predicateWithFormat:@"articleContent CONTAINS[cd] %@", searchString];
}else if (controller.searchBar.selectedScopeButtonIndex == 2){
predicate = [NSPredicate predicateWithFormat:@"ANY tags.tagText CONTAINS[cd] %@", searchString];
}else{
predicate = [NSPredicate predicateWithFormat:@"(ANY tags.tagText CONTAINS[cd] %@) OR (dvarTorahTitle CONTAINS[cd] %@) OR (dvarTorahContent CONTAINS[cd] %@)", searchString,searchString,searchString];
}
for (Article *article in unfilteredResults) {
if ([predicate evaluateWithObject:article]) {
[self.filteredArray addObject:article];
}
}
[unfilteredResults release];
return YES;
}
Edit 4:
I am not required to use regex for this, was just advised to do so. If you have another way that works, go for it!
Edit 5:
I've modified my search to look like this:
NSInteger length = [searchString length];
NSString *vowelsAsRegex = @"[\\u5B0-\\u55C4]*";
NSMutableString *modifiedSearchString = [searchString mutableCopy];
for (int i = length; i > 0; i--) {
[modifiedSearchString insertString:vowelsAsRegex atIndex:i];
}
if (controller.searchBar.selectedScopeButtonIndex == 0) {
predicate = [NSPredicate predicateWithFormat:@"articleTitle CONTAINS[cd] %@", modifiedSearchString];
}else if (controller.searchBar.selectedScopeButtonIndex == 1) {
predicate = [NSPredicate predicateWithFormat:@"articleContent CONTAINS[cd] %@", modifiedSearchString];
}else if (controller.searchBar.selectedScopeButtonIndex == 2){
predicate = [NSPredicate predicateWithFormat:@"ANY tags.tagText CONTAINS[cd] %@", modifiedSearchString];
}else{
predicate = [NSPredicate predicateWithFormat:@"(ANY tags.tagText CONTAINS[cd] %@) OR (dvarTorahTitle CONTAINS[cd] %@) OR (dvarTorahContent CONTAINS[cd] %@)", modifiedSearchString,modifiedSearchString,modifiedSearchString];
}
for (Article *article in unfilteredResults) {
if ([predicate evaluateWithObject:article]) {
[self.filteredArray addObject:article];
}
}
I'm still missing something here, what do I need to do to make this work?
Edit 6:
Okay, almost there. I need to make two more changes to be finished with this.
I need to be able to add other ranges of characters to the regex, which might appear instead of, or in addition to the character in the other set. I've trie changing the first range to this:
[\u05b0-\u05c, \u0591-\u05AF]?
Something tells me that this is incorrect.
Also, I need the rest of the regex to be case insensitive. What modifier do I need to use with the .*
regex to make it case insensitive?
Upvotes: 11
Views: 1868
Reputation: 58087
This answer picks up where the question left off. Please read that for context.
As it turns out, iOS can make regular expressions case insensitive using an Objective-C modifier to NSPredicate. All that's left is to combine the two ranges. I realized that they are actually two consecutive ranges. My final code looks like this:
NSInteger length = [searchString length];
NSString *vowelsAsRegex = @"[\u0591-\u05c4]?[\u0591-\u05c4]?"; //Cantillation: \u0591-\u05AF Vowels: \u05b0-\u05c
NSMutableString *modifiedSearchString = [searchString mutableCopy];
for (int i = length; i > 0; i--) {
[modifiedSearchString insertString:vowelsAsRegex atIndex:i];
}
if (controller.searchBar.selectedScopeButtonIndex == 0) {
predicate = [NSPredicate predicateWithFormat:@"articleTitle CONTAINS[cd] %@", modifiedSearchString];
}else if (controller.searchBar.selectedScopeButtonIndex == 1) {
predicate = [NSPredicate predicateWithFormat:@"articleContent CONTAINS[c] %@", modifiedSearchString];
}else if (controller.searchBar.selectedScopeButtonIndex == 2){
predicate = [NSPredicate predicateWithFormat:@"ANY tags.tagText CONTAINS[c] %@", modifiedSearchString];
}else{
predicate = [NSPredicate predicateWithFormat:@"(ANY tags.tagText CONTAINS[c] %@) OR (dvarTorahTitle CONTAINS[c] %@) OR (dvarTorahContent CONTAINS[c] %@)", modifiedSearchString,modifiedSearchString,modifiedSearchString];
}
[modifiedSearchString release];
for (Article *article in unfilteredResults) {
if ([predicate evaluateWithObject:article]) {
[self.filteredArray addObject:article];
}
}
Note that the range portion of the regular expression repeats itself. This is because there can be both a cantillation mark and a vowel on a single letter. Now, I can search uppercase and lowercase English, and Hebrew with or without vowels and cantillation marks.
Awesome!
Upvotes: 2
Reputation: 7403
The Hebrew vowels are well defined in Unicode: Table of Hebrew characters and Marks
When you receive the input string from the user, you can insert the regular expression [\u05B0-\u05C4]*
in between each character, and before and after the string. (The []
means match any of the included characters, and the *
means match zero or more occurrences of the expression.) Then you can search the text block, using this as a regular expression. This expression allows you to find the exact string from the user's input. The user can also specify required vowels, which this expression would find.
I think that instead of trying to "ignore" the vowels, it would be easier to remove the vowels from both the large block of text and the user's string. Then you could search just the letters, as usual. This method would work if you don't need to display the vocalized text that the user found.
Upvotes: 2