Reputation: 83
I have a string (sadly the format of the string is out of my control) that I need to parse which looks like this:
000010<001>0005</001><002>03</002><003>20140813</003><004>194642</004><006>0000000000</006><007></007><008></008><009>20140901</009><010>ENSK</010><011></011><013>195409108932</013><015></015><016>NORM</016><017>250602</017><018>N</018><019>N</019><020>8</020><021>93892</021><022>TESTVALUE</022><023>00</023><024></024><026>0000000000</026><028>HXF164</028><029>FIAT 60-90 DT</029><030>0000</030><031>MRÖD</031><032>6090DT1L224324</032><033>FI</033><034>007066</034><035>06</035><036>007066</036><037>ITRAFIK</037><038>19970915</038><039>KONVERT</039><040>19841123</040><041>00000000</041><042>19841023</042><043>REGBES</043><044>20050920</044><045></045><046>J</046><047>00000000</047><048></048><049></049><050>00000000</050><051>00</051><052>00000000</052><053>00000000</053><054>000000</054><055>01</055><056>000</056><061>09</061><062></062><064>DIN</064><065>00000</065><066>02</066><067>MANUELL/TESTTEST</067>
The actual string is much longer but this will work for the question at hand (why this format is beyond me but another topic...). I need get each each "xml-ish" element into a separate string so that I can handle the values separately.
I've come up with this reg ex pattern:
const string pattern = @"<\d+>[^<]+?</\d+>";
which matches any element that has a value. I can safely ignore the ones with no value, giving me a list of matches like this:
<001>0005</001> <002>03</002> and skipping those with no value: <007></007>
It seems to do the trick and it will probably work in most cases. However, if for some reason any of the values will include '<' it will not work as intended.
Example:
000010<001>0005</001><002>03</002><003>2014<0813</003><004>194642</004><006>0000000000</006><007></007><008></008><009>2014<0901</009><010>ENSK</010><011></011><013>195409108932</013><015></015><016>NORM</016>
where the 009 element is no longer picked.
Can I tweak the reg ex expression in a way so that I'm safe from this? For some reason I've not been able to make it work like I want to.
This is a great site for testing reg ex if anyone want to play around with it:
Regards
Upvotes: 0
Views: 58
Reputation: 91528
I'd use:
<(\d+)>.+?</\1>
It matches opening and closing tag with same number.
Upvotes: 0
Reputation: 1360
Depending on the regex engine you are using, you can use negative lookahead:
a(?!b)
which means: match "a" that is not followed by "b". So the resulting expression will look like this:
<\d+>([^<]|<(?!\/\d))+?</\d+>
more: http://www.regular-expressions.info/lookaround.html
Upvotes: 2
Reputation: 10814
This will accept <
inside a value, but not </
, which might be stricter and hence closer to what you want:
<\d+>(<[^/]|[^<])+?</\d+>
Upvotes: 1