Reputation: 2043
I have some xml that looks like:
<records>
<Customer>
<Reference>123</Reference>
<Name>John Smith</Name>
<Address1>1, The street</Address1>
<Address2>Upper Town Street</Address2>
<Address3>Anytown</Address3>
<Address4>County</Address4>
<PostCode>POS TCD</PostCode>
</Customer>
</records>
but for which Address2 is optional, so this is also valid:
<records>
<Customer>
<Reference>123</Reference>
<Name>John Smith</Name>
<Address1>1, The street</Address1>
<Address3>Anytown</Address3>
<Address4>County</Address4>
<PostCode>POS TCD</PostCode>
</Customer>
</records>
(Note: this is a cut down xml snippet)
I have the following regex that matches correctly when Address2 is specified:
<Reference>(?<Reference>.*)</Reference>[\w|\W]*<Name>(?<Name>.*)</Name>[\w|\W]*<Address1>(?<Address1>.*)</Address1>[\w|\W]*<Address2>(?<Address2>.*)</Address2>
It doesn't work for the case when Address2 isn't specified. The closest I've got is the following :
<Reference>(?<Reference>.*)</Reference>[\w|\W]*<Name>(?<Name>.*)</Name>[\w|\W]*<Address1>(?<Address1>.*)</Address1>[\w|\W]*(<Address2>(?<Address2>.*)</Address2>)?
which matches and populates Reference, Name and Address1 for both xml snippets, but which leaves Address2 blank in both cases rather than having a value of Upper Town Street for Address 2 for the first snippet.
Aside: I know that using an xml parser would be probably easier but the xml isn't clean and this was supposed to be a quick and easy solution(!). I also know that I can break this down into a set of regexs to resolve, but this has now become a bit of an intellectual challenge. And I'd love to have a solution to it.
Upvotes: 0
Views: 389
Reputation: 75252
Quick and dirty answer:
<Reference>(?<Reference>.*)</Reference>[\w\W]*?<Name>(?<Name>.*)</Name>[\w\W]*?<Address1>(?<Address1>.*)</Address1>[\w\W]*?(<Address2>(?<Address2>.*)</Address2>)?
First, I removed the |
; it wasn't harming anything, but it was unnecessary. [\w\W]
already means a word character, or a character that's not a word character. Like most other metacharacters, |
loses its special meaning inside a character class, and just matches itself.
But the main point was changing the *
to *?
, making it non-greedy. Each [\w\W]*
initially gobbles up the whole rest of the text, then backtracks so it can match the next required part (e.g., <Name>(?<Name>.*)</Name>
). But the Address2
part is not required, so the regex engine doesn't bother backtracking to take it in.
Making the quantifier non-greedy reverses the priorities: before it gobbles up the next character, it first tries to match the next part of the regex. That ensures that the Address2
line gets matched if present, even though it's optional.
But if your XML is really formatted the way you showed it, all there is between the elements is whitespace. I would just use \s*
, and not have to worry about it matching too much or too little.
Upvotes: 1
Reputation: 185730
Instead of using regex, fix your broken xml and use your mind on a most interesting problem =)
regex are not the right tool to parse a xml file. Parsing xml in 2013 is a resolved problem, don't try to re-invent the wheel.
Like you already said, use an XML parser. Add your language in your original POST if you want me to give you some of these.
The best I know to parse xml & html is xpath.
See RegEx match open tags except XHTML self-contained tags
Upvotes: 2