Byyo
Byyo

Reputation: 2243

Split String by XML elements

I have the following input string which is from a 10MB text file

string data = "0x52341\n0x52341<?xml version=\"1.0\" encoding=\"UTF-8\"?><element1 value=\"3\"><sub>1</sub></element1>0x52341\n0x52341 <element1><sub><element>2</element></sub></element1>0x52341<element2><sub>3</sub‌​‌​></element2> <element2><sub>4</sub></element2>0x4312";

now I want this string by element1 and element2 XML nodes

the result in this case should be

output[0] = "<element1 value="3"><sub>1</sub></element1>";
output[1] = "<element1><sub><element>2</element></sub></element1>";
output[2] = "<element2><sub>3</sub></element2>";
output[3] = "<element2><sub>4</sub></element2>";

my efford:

i have tried Regular Expression but that's very slow in case of that big file and i have also tried

string[] output= input.Split(new string[] { "<element1>", "<element2>" }, StringSplitOptions.None);

string.Split() is circuitous because it throws outofmemory exceptions and the delemiter is being removed at splitting.

question: is there a easy way to parse those xml elements out of my text file?

update: I simplified my file because i couldn't post the whole 10MB file in SO - sometimes there are 0x1234 values between the xml elements sometimes not

Upvotes: 0

Views: 5026

Answers (4)

Ondrej Svejdar
Ondrej Svejdar

Reputation: 22074

This processes the file as stream - looks for opening and closing element, parsing only those elements in process:

  using (var stream = File.OpenRead("..."))
  {
    StringBuilder builder = null;
    StringBuilder xml = null;
    using (var reader = new StreamReader(stream, Encoding.UTF8))
    {
      while (!reader.EndOfStream)
      {
        char c = (char)reader.Read();
        if (c == '<' && builder == null)
        {
          builder = new StringBuilder();
        }
        if (builder != null)
        {
          builder.Append(c);
        }
        if (xml != null)
        {
          xml.Append(c);
        }

        if (c == '>')
        {
          var token = builder.ToString();
          if (xml == null)
          {
            if (token.StartsWith("<element1", StringComparison.Ordinal) || token.StartsWith("<element2", StringComparison.Ordinal))
            {
              xml = new StringBuilder("<?xml version='1.0' encoding='utf-8' ?>");
              xml.Append(token);
            }
          }
          else
          {
            if (token.StartsWith("</element1", StringComparison.Ordinal) || token.StartsWith("</element2", StringComparison.Ordinal))
            {
              XElement element = XElement.Parse(xml.ToString());
              // do something with the element
              xml = null;
            }
          }
          builder = null;
        }
      }
    }
  }

Upvotes: 0

fixagon
fixagon

Reputation: 5566

EDIT: A faster alternative (as its not using Regex) which is not replacing 0x... fragments within the content of the elements would be the following one:

string data = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>0x52341<element1 value=\"3\"><sub>1</sub></element1>0x234512 <element1><sub><element>2</element></sub></element1>0x52341<element2><sub>3</sub></element2> <element2><sub>4</sub></element2>0x4312";

XmlReaderSettings xrs = new XmlReaderSettings();
xrs.ConformanceLevel = ConformanceLevel.Fragment;
XDocument doc = new XDocument(new XElement("root"));
XElement root = doc.Descendants().First();

using(var ms = new StreamWriter(new MemoryStream()))
{
    ms.Write(data);
    ms.Flush();
    ms.BaseStream.Position = 0;
    using (StreamReader fs = new StreamReader(ms.BaseStream))
    //using (StreamReader fs = new StreamReader("file.xml"))
    {
        using (XmlReader rdr = XmlReader.Create(fs, xrs))
        {
            while (rdr.Read())
            {
                if (rdr.NodeType == XmlNodeType.Element)
                {
                    root.Add(XElement.Load(rdr.ReadSubtree()));
                }
            }
        }
    }
}

you could also read directly from the file with another StreamReader constructor (remove the StreamWriter part)

Upvotes: 1

Lorek
Lorek

Reputation: 855

Here is a console app that will do it:

class Program
{
    static void Main(string[] args)
    {
        string source = "0x52341<element1 value=\"3\"><sub>1</sub></element1>0x234512 <element1><sub><element>2</element></sub></element1>0x52341<element2><sub>3</sub></element2> <element2><sub>4</sub></element2>0x4312";
        List<string> components = new List<string>();
        while (source.Length > 0)
        {
            int start = source.IndexOf('<');
            if (-1 == start)
                break;
            int next = source.IndexOf("0x", start, StringComparison.OrdinalIgnoreCase);
            if (-1 == next)
                break;
            components.Add(source.Substring(start, next - start));
            source = source.Substring(next);
        }
        foreach (string s in components)
            Console.WriteLine(s);
        Console.ReadLine();
    }
}

Try that out.

Upvotes: 0

Anton Gogolev
Anton Gogolev

Reputation: 115829

If you can guarantee that each <elementX></elementX> fragment is a well-formed XML node (so to speak), wrap the entire string in <elements> ... </elements> and deal with it using standard .NET approaches, be it XmlDocument, Linq to XML or whatever else fits you.

Upvotes: 3

Related Questions