Jane
Jane

Reputation: 2043

XML parsing with optional element in regex

I have some xml that looks like:

<records>
  <Customer>
    <Reference>123</Reference>
    <Name>John Smith</Name>    
    <Address1>1, The street</Address1>
    <Address2>Upper Town Street</Address2>
    <Address3>Anytown</Address3>
    <Address4>County</Address4>
    <PostCode>POS TCD</PostCode>
  </Customer>
</records>

but for which Address2 is optional, so this is also valid:

<records>
  <Customer>
    <Reference>123</Reference>
    <Name>John Smith</Name>    
    <Address1>1, The street</Address1>
    <Address3>Anytown</Address3>
    <Address4>County</Address4>
    <PostCode>POS TCD</PostCode>
  </Customer>
</records>

(Note: this is a cut down xml snippet)

I have the following regex that matches correctly when Address2 is specified:

<Reference>(?<Reference>.*)</Reference>[\w|\W]*<Name>(?<Name>.*)</Name>[\w|\W]*<Address1>(?<Address1>.*)</Address1>[\w|\W]*<Address2>(?<Address2>.*)</Address2>

It doesn't work for the case when Address2 isn't specified. The closest I've got is the following :

<Reference>(?<Reference>.*)</Reference>[\w|\W]*<Name>(?<Name>.*)</Name>[\w|\W]*<Address1>(?<Address1>.*)</Address1>[\w|\W]*(<Address2>(?<Address2>.*)</Address2>)?

which matches and populates Reference, Name and Address1 for both xml snippets, but which leaves Address2 blank in both cases rather than having a value of Upper Town Street for Address 2 for the first snippet.

Aside: I know that using an xml parser would be probably easier but the xml isn't clean and this was supposed to be a quick and easy solution(!). I also know that I can break this down into a set of regexs to resolve, but this has now become a bit of an intellectual challenge. And I'd love to have a solution to it.

Upvotes: 0

Views: 389

Answers (2)

Alan Moore
Alan Moore

Reputation: 75252

Quick and dirty answer:

<Reference>(?<Reference>.*)</Reference>[\w\W]*?<Name>(?<Name>.*)</Name>[\w\W]*?<Address1>(?<Address1>.*)</Address1>[\w\W]*?(<Address2>(?<Address2>.*)</Address2>)?

First, I removed the |; it wasn't harming anything, but it was unnecessary. [\w\W] already means a word character, or a character that's not a word character. Like most other metacharacters, | loses its special meaning inside a character class, and just matches itself.

But the main point was changing the * to *?, making it non-greedy. Each [\w\W]* initially gobbles up the whole rest of the text, then backtracks so it can match the next required part (e.g., <Name>(?<Name>.*)</Name>). But the Address2 part is not required, so the regex engine doesn't bother backtracking to take it in.

Making the quantifier non-greedy reverses the priorities: before it gobbles up the next character, it first tries to match the next part of the regex. That ensures that the Address2 line gets matched if present, even though it's optional.

But if your XML is really formatted the way you showed it, all there is between the elements is whitespace. I would just use \s*, and not have to worry about it matching too much or too little.

Upvotes: 1

Gilles Qu&#233;not
Gilles Qu&#233;not

Reputation: 185730

Instead of using , fix your broken and use your mind on a most interesting problem =)

are not the right tool to parse a file. Parsing in 2013 is a resolved problem, don't try to re-invent the wheel.

Like you already said, use an XML parser. Add your language in your original POST if you want me to give you some of these.

The best I know to parse & is .


See RegEx match open tags except XHTML self-contained tags

Upvotes: 2

Related Questions