matsemann
matsemann

Reputation: 83

How can I parse an XML-like string where values may have an unencoded < character?

I have a string (sadly the format of the string is out of my control) that I need to parse which looks like this:

000010<001>0005</001><002>03</002><003>20140813</003><004>194642</004><006>0000000000</006><007></007><008></008><009>20140901</009><010>ENSK</010><011></011><013>195409108932</013><015></015><016>NORM</016><017>250602</017><018>N</018><019>N</019><020>8</020><021>93892</021><022>TESTVALUE</022><023>00</023><024></024><026>0000000000</026><028>HXF164</028><029>FIAT 60-90 DT</029><030>0000</030><031>MRÖD</031><032>6090DT1L224324</032><033>FI</033><034>007066</034><035>06</035><036>007066</036><037>ITRAFIK</037><038>19970915</038><039>KONVERT</039><040>19841123</040><041>00000000</041><042>19841023</042><043>REGBES</043><044>20050920</044><045></045><046>J</046><047>00000000</047><048></048><049></049><050>00000000</050><051>00</051><052>00000000</052><053>00000000</053><054>000000</054><055>01</055><056>000</056><061>09</061><062></062><064>DIN</064><065>00000</065><066>02</066><067>MANUELL/TESTTEST</067>

The actual string is much longer but this will work for the question at hand (why this format is beyond me but another topic...). I need get each each "xml-ish" element into a separate string so that I can handle the values separately.

I've come up with this reg ex pattern:

const string pattern = @"<\d+>[^<]+?</\d+>"; 

which matches any element that has a value. I can safely ignore the ones with no value, giving me a list of matches like this:

<001>0005</001> <002>03</002> and skipping those with no value: <007></007>

It seems to do the trick and it will probably work in most cases. However, if for some reason any of the values will include '<' it will not work as intended.

Example:

000010<001>0005</001><002>03</002><003>2014<0813</003><004>194642</004><006>0000000000</006><007></007><008></008><009>2014<0901</009><010>ENSK</010><011></011><013>195409108932</013><015></015><016>NORM</016>

where the 009 element is no longer picked.

Can I tweak the reg ex expression in a way so that I'm safe from this? For some reason I've not been able to make it work like I want to.

This is a great site for testing reg ex if anyone want to play around with it:

http://www.regexr.com/

Regards

Upvotes: 0

Views: 58

Answers (3)

Toto
Toto

Reputation: 91528

I'd use:

<(\d+)>.+?</\1>

It matches opening and closing tag with same number.

Upvotes: 0

cdm
cdm

Reputation: 1360

Depending on the regex engine you are using, you can use negative lookahead:

a(?!b)

which means: match "a" that is not followed by "b". So the resulting expression will look like this:

<\d+>([^<]|<(?!\/\d))+?</\d+>

more: http://www.regular-expressions.info/lookaround.html

Upvotes: 2

lynn
lynn

Reputation: 10814

This will accept < inside a value, but not </, which might be stricter and hence closer to what you want:

<\d+>(<[^/]|[^<])+?</\d+>

Upvotes: 1

Related Questions