Ben McCormack
Ben McCormack

Reputation: 33118

What regex could I use to extract a body of XML text from a body of unformatted text?

Let's say I have the following body of text:

Call me Ishmael. Some years ago- never mind how long precisely- having little 
or no money in my purse, and nothing particular to interest me on shore, I 
thought I would sail about a little and see the watery part of the world. It is  
<?xml version="1.0" encoding="utf-8"?>
<RootElement xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
     xmlns:xsd="http://www.w3.org/2001/XMLSchema">
   <ChildElement />
   <ChildElement />
</RootElement>
a way I have of driving off the spleen and regulating the circulation. Whenever  
I find myself growing grim about the mouth; whenever it is a damp, drizzly 
November in my soul; 

What regex could I use that would return to me the XML embedding in the string?

NOTE: I can assume that <RootElement> and </RootElement> will always have the same name.

Upvotes: 0

Views: 1082

Answers (2)

Tim Pietzcker
Tim Pietzcker

Reputation: 336468

I understand that the root element will not always be called RootElement, so you can use

<\?xml[^>]+>\s*<\s*(\w+).*?<\s*/\s*\1>

using RegexOptions.SingleLine. This will take the first tag name after the opening ´` tag and capture everything until the matching tag.

In C#:

resultString = Regex.Match(subjectString, @"<\?xml[^>]+>\s*<\s*(\w+).*?<\s*/\s*\1>", RegexOptions.Singleline).Value;

Upvotes: 2

SLaks
SLaks

Reputation: 888187

If you know that the root element will always be <RootElement ...> and that there will never be a nested <RootElement> tag, you can do it like this:

\<\?xml .+?\</RootElement\>

This regex will lazily match all text between <?xml and </RootElement>.

Upvotes: 2

Related Questions