XML parsing with optional element in regex

Question

I have some xml that looks like:


  
    123
    John Smith    
    1, The street
    Upper Town Street
    Anytown
    County
    POS TCD

but for which Address2 is optional, so this is also valid:


  
    123
    John Smith    
    1, The street
    Anytown
    County
    POS TCD

(Note: this is a cut down xml snippet)

I have the following regex that matches correctly when Address2 is specified:

(?.*)[\w|\W]*(?.*)[\w|\W]*(?.*)[\w|\W]*(?.*)

It doesn't work for the case when Address2 isn't specified. The closest I've got is the following :

(?.*)[\w|\W]*(?.*)[\w|\W]*(?.*)[\w|\W]*((?.*))?

which matches and populates Reference, Name and Address1 for both xml snippets, but which leaves Address2 blank in both cases rather than having a value of Upper Town Street for Address 2 for the first snippet.

Aside: I know that using an xml parser would be probably easier but the xml isn't clean and this was supposed to be a quick and easy solution(!). I also know that I can break this down into a set of regexs to resolve, but this has now become a bit of an intellectual challenge. And I'd love to have a solution to it.

Alan Moore · Accepted Answer

Quick and dirty answer:

(?.*)[\w\W]*?(?.*)[\w\W]*?(?.*)[\w\W]*?((?.*))?

First, I removed the |; it wasn't harming anything, but it was unnecessary. [\w\W] already means a word character, or a character that's not a word character. Like most other metacharacters, | loses its special meaning inside a character class, and just matches itself.

But the main point was changing the * to *?, making it non-greedy. Each [\w\W]* initially gobbles up the whole rest of the text, then backtracks so it can match the next required part (e.g., (?.*)). But the Address2 part is not required, so the regex engine doesn't bother backtracking to take it in.

Making the quantifier non-greedy reverses the priorities: before it gobbles up the next character, it first tries to match the next part of the regex. That ensures that the Address2 line gets matched if present, even though it's optional.

But if your XML is really formatted the way you showed it, all there is between the elements is whitespace. I would just use \s*, and not have to worry about it matching too much or too little.

XML parsing with optional element in regex

Answers (2)

Related Questions