mm24
mm24

Reputation: 9596

Tokenizing string in ios with regex

In short:

Given the followoing string:

Input string -> "hello, world" , oh my, parapappa12

I want to extract these three "tokens":

Output tokens ->


Tokenizing string in ios

I got a file containing some data. It looks something like:

word , word, word 
word , word, word 
word , word, word 

where some words can contain a "," but only when the word starts and end with a certain character, eg. starts with " and ends with "

Example of words:

word : blebla bla bla
word : "bla bla bla, bla"

How do I define a regular expression to tokenize the file based on the "," ingoring white spaces between the words and including this "special" case?

I remember using regex in Perl to achieve something similar but was long time ago and I kind of forgot the syntax and I am not sure if this is supported in Objective-C and iOS

Upvotes: 1

Views: 430

Answers (2)

Brad Allred
Brad Allred

Reputation: 7534

Without knowing the context of why you need to parse strings like this I can't give you a great answer, but I here are some ideas that might be better than RegEx if you find yourself needing to parse something more complicated or if you would just like to learn more about state machines and grammars.

  1. You can easily write a basic state machine parser to do basic parsing using NSScanner (the code from that link isn't great so ignore it, but the concept is illustrated)
  2. You can use something like ParseKit for really heavy duty parsing (probably overkill here)

You seem content with RegEx, but maybe this will help future visitors.

Upvotes: 0

Alexander Farber
Alexander Farber

Reputation: 22988

First, a Perl oneliner (here fullscreen):

perl screenshot

# echo -n '"hello, world" , oh my, parapappa12' | perl -ne 'print "<$1>\n" while /("[^"]*"|[^, ]+)/g'
<"hello, world">
<oh>
<my>
<parapappa12>

And here the Objective C method:

NSString* const str = @"\"hello, world\" , oh my, parapappa12";
[self splitCommas:str];

- (void)splitCommas:(NSString*)str
{
    NSString* const pattern = @"(\"[^\"]*\"|[^, ]+)";

    NSRegularExpression *regex = [[NSRegularExpression alloc] initWithPattern:pattern
                                                                      options:0
                                                                        error:nil];
    NSRange searchRange = NSMakeRange(0, [str length]);
    NSArray *matches = [regex matchesInString:str
                                      options:0
                                        range:searchRange];

    for (NSTextCheckingResult *match in matches) {
        NSRange matchRange = [match range];
        NSLog(@"%@", [str substringWithRange:matchRange]);
    }
}

Explanation for the regex:

  1. You either search for "quoted strings": "[^"]*" (anything but quote)
  2. Or you capture anything between commas: [^, ]+ (anything but comma or space)

(the square brackets define the "character class" and the caret negates it).

Note: My solution doesn't handle escaped quotes like in "I say \"Hello\""

Upvotes: 1

Related Questions