Reputation: 2243
I have the following input string which is from a 10MB text file. Sometimes there are \n and other values between the xml elements sometimes not.
string data = "\n<?xml version=\"1.0\" encoding=\"UTF-8\"?><element1 value=\"3\"><sub>1</sub></element1>\n<element1><sub><element>2</element></sub></element1>\n \n<element2><sub>3</sub></element2>\n \n<element2><sub>4</sub></element2>";
now I want this string by element1
and element2
XML nodes
the result in this case should be
output[0] = "<element1 value=\"3\"><sub>1</sub></element1>";
output[1] = "<element1><sub><element>2</element></sub></element1>";
output[2] = "<element2><sub>3</sub></element2>";
output[3] = "<element2><sub>4</sub></element2>";
i've tried
string[] output= input.Split(new string[] { "<element1>", "<element2>" }, StringSplitOptions.None);
but it throws outofmemory exceptions and the delemiter is being removed at splitting.
and
XmlDocument xml = new XmlDocument();
xml.LoadXml("<root>"+data +"</root>");
throws a exception
is there a easy way to parse those xml elements out of my text file?
Upvotes: 0
Views: 755
Reputation: 4893
You will need to remove the xml header and then put the root node. After that, you can use XDocument to parse and select needed elements.
string data = "\n<?xml version=\"1.0\" encoding=\"UTF-8\"?><element1 value=\"3\"><sub>1</sub></element1>\n<element1><sub><element>2</element></sub></element1>\n \n<element2><sub>3</sub></element2>\n \n<element2><sub>4</sub></element2>";
//Clear whitespace and parse out the header
data = data.Trim();
var pos = data.IndexOf("?>") + 2;
data = string.Concat("<root>",data.Substring(pos, data.Length - pos), "</root>");
var xml = XDocument.Parse(data);
//Nodes will have all the elements1, 2... etc.
var nodes = xml.Descendants().Where(d => d.Name.LocalName.Contains("element"));
//if you need to load to string list.
var items = new List<string>();
foreach(var node in nodes)
{
items.Add(node.ToString());
}
Upvotes: 3
Reputation: 59218
Whereever you get the invalid XML from: talk to him and ask to provide valid XML. Everything else is a hack and will break sooner or later.
The not recommended hacky and unstable version:
"<root>"+data +"</root>"
gives you the following XML
<root>
<?xml version="1.0" encoding="UTF-8"?>
<element1 value="3"><sub>1</sub></element1>
<element1><sub><element>2</element></sub></element1>
<element2><sub>3</sub></element2>
<element2><sub>4</sub></element2>
</root>
which is invalid because the processing instruction is not at the beginning.
Remove the processing instruction and it should work. Finding the first "?>"
and removing everything before sounds quite safe to me. In real XML you'd have to consider multiple processing instructions like <?xml ...?>
and <?xml-stylesheet ... ?>
.
Upvotes: 2