Reputation: 60859
Let's assume I can have the following strings:
"hey @john..."
"@john, hello"
"@john(hello)"
I am tokenizing the string to get every word separated by a space:
[myString componentsSeparatedByString:@" "];
My array of tokens now contain:
@john...
@john,
@john(hello)
I am checking for punctation marks as follows:
NSRange textRange = [words rangeOfString:@","];
if(textRange.location != NSNotFound){ } //do something
For these cases. How can I make sure only @john is tokenized, while retaining the trailing characters:
...
,
(hello)
Note: I would like to be able to handle all cases of characters at the end of a string. The above are just 3 examples.
Upvotes: 0
Views: 637
Reputation: 96323
Are you sure CFStringTokenizer or its new Snow-Leopard-only Cocoa equivalent wouldn't be a better fit?
Splitting on just spaces is a very naïve way to tokenize, as you've found. CFStringTokenizer and enumerateSubstrings…
are much smarter about real human-language lexical rules.
Upvotes: 0
Reputation: 46020
You could use NSScanner
and NSCharacterSet
to do this. NSScanner
can scan a string up to the first occurrence of a character in a set. If you get the +alphaNumericCharacterSet
and then call -invertedSet
on it, you'll get a set of all non-alphanumeric characters.
This is probably not super-efficient but it will work:
NSArray* strings = [NSArray arrayWithObjects:
@"hey @john...",
@"@john, hello",
@"@john(hello)",
nil];
//get the characters we want to skip, which is everything except letters and numbers
NSCharacterSet* illegalChars = [[NSCharacterSet alphanumericCharacterSet] invertedSet];
for(NSString* currentString in strings)
{
//this stores the tokens for the current string
NSMutableArray* tokens = [NSMutableArray array];
//split the string into unparsed tokens
NSArray* split = [currentString componentsSeparatedByString:@" "];
for(NSString* currentToken in split)
{
//we only want tokens that start with an @ symbol
if([currentToken hasPrefix:@"@"])
{
NSString* token = nil;
//start a scanner from the first character after the @ symbol
NSScanner* scanner = [NSScanner scannerWithString:[currentToken substringFromIndex:1]];
//keep scanning until we hit an illegal character
[scanner scanUpToCharactersFromSet:illegalChars intoString:&token];
//get the rest of the string
NSString* suffix = [currentToken substringFromIndex:[scanner scanLocation] + 1];
if(token)
{
//store the token in a dictionary
NSDictionary* tokenDict = [NSDictionary dictionaryWithObjectsAndKeys:
[@"@" stringByAppendingString:token], @"token", //prepend the @ symbol that we skipped
suffix, @"suffix",
nil];
[tokens addObject:tokenDict];
}
}
}
//output
for(NSDictionary* dict in tokens)
{
NSLog(@"Found token: %@ additional characters: %@",[dict objectForKey:@"token"],[dict objectForKey:@"suffix"]);
}
}
Upvotes: 0
Reputation: 61228
See NSString's -rangeOfString:options:range:... give it a range of { [myString length] - [searchString length], [searchString length] }
and see if the resulting range's location is equal to NSNotFound
. See the NSStringCompareOptions
options in the docs for case sensitivity, etc.
Upvotes: 1