Reputation: 31
I have a problem finding a regular expression. I have some text, maybe divided by some xml. For example:
<root>
<text>Thi</text>
<text>s is ju</text>
<text><bold>s</bold></text>
<text>t a tes</text>
<text><italic>t</italic></text>
</root>
I want to search for the word "just" in the xml and need the result
ju</text>
<text><bold>s</bold></text>
<text>t
Is there any posibility to get this result with a regular expression?
By the way: I already have the regular expression to get the plain text from the xml, it is (in C#-Syntax):
string plaintext = new Regex(@"\<[^\<]*\>").Replace(xmlstring, string.Empty);
This one finds every "<" to ">" with everything (*) in between but not another "<" and replaces it with string.Empty. So i get the plain text and could search for my "just", but the result would just be "just", not with the xml in between...
Does anybody have an idea?
Upvotes: 2
Views: 126
Reputation: 12807
If you have XML in single line (with no whitespaces), you can create your regex by splitting letters in just
by (?:<[^>]*>)*
regex parts. Example:
j(?:<[^>]*>)*u(?:<[^>]*>)*s(?:<[^>]*>)*t
If you still need to process multiline xml, you can split letters by (?! )(?:<[^>]*>\s*)*(?<! )
regex. It would allow whitespaces between XML tags, but wouldn't allow space before or after letter.
j(?! )(?:<[^>]*>\s*)*(?<! )u(?! )(?:<[^>]*>\s*)*(?<! )s(?! )(?:<[^>]*>\s*)*(?<! )t
Upvotes: 1
Reputation: 7534
Better don't use regexp over xml. Just don't.
According to your task, after each character of string you are looking for, you can expect any xml tags. So basically you need to insert 'maybetag' regex part after each letter - something like this:
j(\<[^\<]*?\>\s*)*u(\<[^\<]*?\>\s*)*s(\<[^\<]*?\>\s*)*t(\<[^\<]*?\>\s*)*
Working sample http://www.rexfiddle.net/WdkpliZ
Upvotes: 1