Reputation: 2243
I have the following input string which is from a 10MB text file
string data = "0x52341\n0x52341<?xml version=\"1.0\" encoding=\"UTF-8\"?><element1 value=\"3\"><sub>1</sub></element1>0x52341\n0x52341 <element1><sub><element>2</element></sub></element1>0x52341<element2><sub>3</sub></element2> <element2><sub>4</sub></element2>0x4312";
now I want this string by element1
and element2
XML nodes
the result in this case should be
output[0] = "<element1 value="3"><sub>1</sub></element1>";
output[1] = "<element1><sub><element>2</element></sub></element1>";
output[2] = "<element2><sub>3</sub></element2>";
output[3] = "<element2><sub>4</sub></element2>";
my efford:
i have tried Regular Expression but that's very slow in case of that big file and i have also tried
string[] output= input.Split(new string[] { "<element1>", "<element2>" }, StringSplitOptions.None);
string.Split()
is circuitous because it throws outofmemory exceptions and the delemiter is being removed at splitting.
question: is there a easy way to parse those xml elements out of my text file?
update: I simplified my file because i couldn't post the whole 10MB file in SO - sometimes there are 0x1234 values between the xml elements sometimes not
Upvotes: 0
Views: 5026
Reputation: 22074
This processes the file as stream - looks for opening and closing element, parsing only those elements in process:
using (var stream = File.OpenRead("..."))
{
StringBuilder builder = null;
StringBuilder xml = null;
using (var reader = new StreamReader(stream, Encoding.UTF8))
{
while (!reader.EndOfStream)
{
char c = (char)reader.Read();
if (c == '<' && builder == null)
{
builder = new StringBuilder();
}
if (builder != null)
{
builder.Append(c);
}
if (xml != null)
{
xml.Append(c);
}
if (c == '>')
{
var token = builder.ToString();
if (xml == null)
{
if (token.StartsWith("<element1", StringComparison.Ordinal) || token.StartsWith("<element2", StringComparison.Ordinal))
{
xml = new StringBuilder("<?xml version='1.0' encoding='utf-8' ?>");
xml.Append(token);
}
}
else
{
if (token.StartsWith("</element1", StringComparison.Ordinal) || token.StartsWith("</element2", StringComparison.Ordinal))
{
XElement element = XElement.Parse(xml.ToString());
// do something with the element
xml = null;
}
}
builder = null;
}
}
}
}
Upvotes: 0
Reputation: 5566
EDIT: A faster alternative (as its not using Regex) which is not replacing 0x... fragments within the content of the elements would be the following one:
string data = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>0x52341<element1 value=\"3\"><sub>1</sub></element1>0x234512 <element1><sub><element>2</element></sub></element1>0x52341<element2><sub>3</sub></element2> <element2><sub>4</sub></element2>0x4312";
XmlReaderSettings xrs = new XmlReaderSettings();
xrs.ConformanceLevel = ConformanceLevel.Fragment;
XDocument doc = new XDocument(new XElement("root"));
XElement root = doc.Descendants().First();
using(var ms = new StreamWriter(new MemoryStream()))
{
ms.Write(data);
ms.Flush();
ms.BaseStream.Position = 0;
using (StreamReader fs = new StreamReader(ms.BaseStream))
//using (StreamReader fs = new StreamReader("file.xml"))
{
using (XmlReader rdr = XmlReader.Create(fs, xrs))
{
while (rdr.Read())
{
if (rdr.NodeType == XmlNodeType.Element)
{
root.Add(XElement.Load(rdr.ReadSubtree()));
}
}
}
}
}
you could also read directly from the file with another StreamReader constructor (remove the StreamWriter part)
Upvotes: 1
Reputation: 855
Here is a console app that will do it:
class Program
{
static void Main(string[] args)
{
string source = "0x52341<element1 value=\"3\"><sub>1</sub></element1>0x234512 <element1><sub><element>2</element></sub></element1>0x52341<element2><sub>3</sub></element2> <element2><sub>4</sub></element2>0x4312";
List<string> components = new List<string>();
while (source.Length > 0)
{
int start = source.IndexOf('<');
if (-1 == start)
break;
int next = source.IndexOf("0x", start, StringComparison.OrdinalIgnoreCase);
if (-1 == next)
break;
components.Add(source.Substring(start, next - start));
source = source.Substring(next);
}
foreach (string s in components)
Console.WriteLine(s);
Console.ReadLine();
}
}
Try that out.
Upvotes: 0
Reputation: 115829
If you can guarantee that each <elementX></elementX>
fragment is a well-formed XML node (so to speak), wrap the entire string in <elements> ... </elements>
and deal with it using standard .NET approaches, be it XmlDocument
, Linq to XML or whatever else fits you.
Upvotes: 3