Lucien
Lucien

Reputation: 451

NSRegularExpression find pattern with optional part

Here is the thing :

I have a file storing some datas, following the pattern :

item1:value1 item2:value2 item3:value3 // \n
item1:value1 item2:value2
item1:value1 item2:value2
// and so on...

// item3:value3 IS OPTIONAL

Then I store the datas of the file in a NSString, to deal with them.

I want to match value2 but the thing is that the pesence of item3:value3 is optional in each line.

So I tried to use the ? regular expression operator but I'm not sure about the way to use it.

So typically I tried to match the following pattern (which doesn't work, ofc):

@"item1:.* item2:(.*) (item3:.*)?\n"

Better explained, I want to regroup the 2 conditions in 1 :

@"item1:.* item2:(.*) item3:.*\n" // Case 1 : item3:.* present in the line
@"item1:.* item2:(.*)\n"          // Case 2 : item3 not present

Note that I already made a personnal function that returns all matches in an NSMutableArray.

I hope this is clear enough :/

Thanks for help and ideas.

Upvotes: 0

Views: 560

Answers (2)

Stone Mason
Stone Mason

Reputation: 2024

Ok, it looks like there were a couple of errors in that regular expression: I'll run through them now.

Firstly, you are trying to match the end of a line with "\n". This will work fine if your string ends in a new line, but will not match the last line otherwise. To fix this, use the "$" symbol, and make sure to pass NSRegularExpressionAnchorsMatchLines as the options: parameter when you instantiate the regular expression, like:

NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:@"item1:.* item2:(.*?)(?: item3:.*)?$"
                                                  options:NSRegularExpressionAnchorsMatchLines
                                                  error:nil];

The $ symbol is called an anchor, and by default matches the end of the string. Its opposite is the ^ anchor, which matches the start of the string. If you pass the NSRegularExpressionAnchorsMatchLines option, however, these anchors change behaviour to match the start and end of any line of the string.

Secondly, you're using plane parethensis, (), to group the "item3:" part, but you don't want to get this group out as a result of the match (a "capture"). If you don't want to "capture" the text in a group, write the group like (?:...). Strictly, using plane parenthesis will work (and does in your example), but means that the regular expression engine must do more work, as it needs to keep track of what's inside the capture group so that you can access it when the method returns (in your case with rangeAtIndex:2).

Thirdly, you misplaced a space in your regular expression (just before the open parenthesis of the item3 group), such that your regular expression would only match a line if the data of item2 ended in a space or the line had a item3 entry. This is what made it seem as though the ? wasn't working in your regular expression, and would have solved your main problem on its own. The space needs to be inside the group that is followed by the question mark, otherwise your regular expression will only match if the space is actually there!

And finally: the * operator is greedy by default, meaning that it will match as much as it possibly can. This has the effect of making the (.*) part of your regular expression eat up all of the text until the end of the line, and the regular expression will still match, because the (item3:.*)? part is optional. Placing a ? after the * (i.e. .*?) changes the way that the * works, making it lazy so that it matches as little text as possible, meaning that, if possible, the regular expression will prefer to match the item3 part of a line with the (item3:.*)? part of the regular expression over the item2:(.*) part of the regular expression.

So your regular expression would look like:

@"item1:.* item2:(.*?)(?: item3:.*)?$"

Upvotes: 1

uchuugaka
uchuugaka

Reputation: 12782

So if you have reliably consistent patterns in you text you can analyze the patterns to build your regular expression and Objective-C logic.

First identify substrings that reliably separate elements you are interested in. Assuming from what is pasted, First you might separate each item by new line separator. Make an array of lines. This is useful if each series of numbered items are related somehow.

Next, it looks from what you've pasted that you might have multiple ways to identify the portions of each line that you are interested in.

Again, you really need to simply have some idea of what might be in your strings and what won't be in them.

You could use the white space to further identify separate items, if and only if the items themselves will not contain white space. If you can only verify that an item is defined like this, then you have a bit of work: Definition: an item is the string immediately following a string with the pattern : Beginning of line or a single space followed by "item" followed by a number 1, 2, or 3 followed by a ":"

The end of the value string is delimited by an end of line or the delimiter beginning another item.

From this you should be able to replace the definition of the pattern with a regular expression.

You will have an easier time if you break this into multiple steps using programming language logic and conditionals and don't try to do everything in a single regular expression.

Upvotes: 0

Related Questions