Byyo
Byyo

Reputation: 2243

Parsing multiple XML objects in one file

I have the following input string which is from a 10MB text file. Sometimes there are \n and other values between the xml elements sometimes not.

string data = "\n<?xml version=\"1.0\" encoding=\"UTF-8\"?><element1 value=\"3\"><sub>1</sub></element1>\n<element1><sub><element>2</element></sub></element1>\n \n<element2><sub>3</sub></element2>\n \n<element2><sub>4</sub></element2>";

now I want this string by element1 and element2 XML nodes

the result in this case should be

output[0] = "<element1 value=\"3\"><sub>1</sub></element1>";
output[1] = "<element1><sub><element>2</element></sub></element1>";
output[2] = "<element2><sub>3</sub></element2>";
output[3] = "<element2><sub>4</sub></element2>";

i've tried

string[] output= input.Split(new string[] { "<element1>", "<element2>" }, StringSplitOptions.None);

but it throws outofmemory exceptions and the delemiter is being removed at splitting.

and

XmlDocument xml = new XmlDocument();
xml.LoadXml("<root>"+data +"</root>");

throws a exception

is there a easy way to parse those xml elements out of my text file?

Upvotes: 0

Views: 755

Answers (2)

loopedcode
loopedcode

Reputation: 4893

You will need to remove the xml header and then put the root node. After that, you can use XDocument to parse and select needed elements.

    string data = "\n<?xml version=\"1.0\" encoding=\"UTF-8\"?><element1 value=\"3\"><sub>1</sub></element1>\n<element1><sub><element>2</element></sub></element1>\n \n<element2><sub>3</sub></element2>\n \n<element2><sub>4</sub></element2>";

    //Clear whitespace and parse out the header
    data = data.Trim();
    var pos = data.IndexOf("?>") + 2;
    data = string.Concat("<root>",data.Substring(pos, data.Length - pos), "</root>");

    var xml = XDocument.Parse(data);

    //Nodes will have all the elements1, 2... etc.
    var nodes = xml.Descendants().Where(d => d.Name.LocalName.Contains("element"));

    //if you need to load to string list.
    var items = new List<string>();
    foreach(var node in nodes)
    {
        items.Add(node.ToString());
    }

Upvotes: 3

Thomas Weller
Thomas Weller

Reputation: 59218

Whereever you get the invalid XML from: talk to him and ask to provide valid XML. Everything else is a hack and will break sooner or later.

The not recommended hacky and unstable version:

"<root>"+data +"</root>" gives you the following XML

<root>
<?xml version="1.0" encoding="UTF-8"?>
    <element1 value="3"><sub>1</sub></element1>
    <element1><sub><element>2</element></sub></element1>
    <element2><sub>3</sub></element2>
    <element2><sub>4</sub></element2>
</root>

which is invalid because the processing instruction is not at the beginning.

Remove the processing instruction and it should work. Finding the first "?>" and removing everything before sounds quite safe to me. In real XML you'd have to consider multiple processing instructions like <?xml ...?> and <?xml-stylesheet ... ?>.

Upvotes: 2

Related Questions