Reputation: 63
I'm making some huge XML files (several GB) with the help of XmlWriter and Linq2Xml. This files are of type:
<Table recCount="" recLength="">
<Rec recId="1">..</Rec>
<Rec recId="2">..</Rec>
..
<Rec recId="n">..</Rec>
</Table>
I don't know values for Table's recCount and recLength attributes until I write all the inner Rec nodes, so I have to write values to these attributes at the very end.
Right now I'm writing all the inner Rec nodes to a temp file, calculate Table's attributes' values and write everything the way I've shown above to a resulting file. (copying everything from the temp file with all the Rec nodes)
I'm wondering if there is a way to modify these attributes' values without writing stuff to another file (like I do it right now) or loading the whole document into memory (which is obviously not possible due to size of these files)?
Upvotes: 4
Views: 1489
Reputation: 111920
Heavily commented code. The basic idea is that in the first pass we write:
<?xml version="1.0" encoding="utf-8"?>
<Table recCount="$1" recLength="$2">
<!--Reserved space:++++++++++++++++-->
<Rec...
Then we go back to the beginning of the file and we rewrite the first three lines:
<?xml version="1.0" encoding="utf-8"?>
<Table recCount="1000" recLength="150">
<!--Reserved space:#############-->
The important "trick" here is that you can't "insert" into a file, you can only overwrite it. So we "reserve" some space for the digits (the Reserved space:#############.
comment. There are many many ways we could have done it... For example, in the first pass we could have:
<Table recCount=" " recLength=" ">
and then (xml-legal but ugly):
<Table recCount="1000 " recLength="150 ">
Or we could have appended the space after the >
of Table:
<Table recCount="" recLength="">
(there are 20 spaces after the >
)
Then:
<Table recCount="1000" recLength="150">
(now there are are 13 spaces after the >
)
Or we could have simply added the spaces without the <!-- -->
on a new line...
The code:
int maxRecCountLength = 10; // int.MaxValue.ToString().Length
int maxRecLengthLength = 10; // int.MaxValue.ToString().Length
int tokenLength = 4; // 4 == $1 + $2, see below what $1 and $2 are
// Note that the reserved space will be in the form +++++++++++++++++++
string reservedSpace = new string('+', maxRecCountLength + maxRecLengthLength - tokenLength);
// You have to manually open the FileStream
using (var fs = new FileStream("out.xml", FileMode.Create))
// and add a StreamWriter on top of it
using (var sw = new StreamWriter(fs, Encoding.UTF8, 4096, true))
{
// Here you write on your StreamWriter however you want.
// Note that recCount and recLength have a placeholder $1 and $2.
int recCount = 0;
int maxRecLength = 0;
using (var xw = XmlWriter.Create(sw))
{
xw.WriteWhitespace("\r\n");
xw.WriteStartElement("Table");
xw.WriteAttributeString("recCount", "$1");
xw.WriteAttributeString("recLength", "$2");
// You have to add some white space that will be
// partially replaced by the recCount and recLength value
xw.WriteWhitespace("\r\n");
xw.WriteComment("Reserved space:" + reservedSpace);
// <--------- BEGIN YOUR CODE
for (int i = 0; i < 100; i++)
{
xw.WriteWhitespace("\r\n");
xw.WriteStartElement("Rec");
string str = string.Format("Some number: {0}", i);
if (str.Length > maxRecLength)
{
maxRecLength = str.Length;
}
xw.WriteValue(str);
recCount++;
xw.WriteEndElement();
}
// <--------- END YOUR CODE
xw.WriteWhitespace("\r\n");
xw.WriteEndElement();
}
sw.Flush();
// Now we read the first lines to modify them (normally we will
// read three lines, the xml header, the <Table element and the
// <-- Reserved space:
fs.Position = 0;
var lines = new List<string>();
using (var sr = new StreamReader(fs, sw.Encoding, false, 4096, true))
{
while (true)
{
string str = sr.ReadLine();
lines.Add(str);
if (str.StartsWith("<Table"))
{
// We read the next line, the comment line
str = sr.ReadLine();
lines.Add(str);
break;
}
}
}
string strCount = XmlConvert.ToString(recCount);
string strMaxRecLength = XmlConvert.ToString(maxRecLength);
// We do some replaces for the tokens
int oldLen = lines[lines.Count - 2].Length;
lines[lines.Count - 2] = lines[lines.Count - 2].Replace("=\"$1\"", string.Format("=\"{0}\"", strCount));
lines[lines.Count - 2] = lines[lines.Count - 2].Replace("=\"$2\"", string.Format("=\"{0}\"", strMaxRecLength));
int newLen = lines[lines.Count - 2].Length;
// Remove spaces from reserved whitespace
lines[lines.Count - 1] = lines[lines.Count - 1].Replace(":" + reservedSpace, ":" + new string('#', reservedSpace.Length - newLen + oldLen));
// We move back to just after the UTF8/UTF16 preamble
fs.Position = sw.Encoding.GetPreamble().Length;
// And we rewrite the lines
foreach (string str in lines)
{
sw.Write(str);
sw.Write("\r\n");
}
}
Slower .NET 3.5 way
In .NET 3.5 the StreamReader
/StreamWriter
want to close the base FileStream
, so I have to reopen various times the file. This is a little little slower.
int maxRecCountLength = 10; // int.MaxValue.ToString().Length
int maxRecLengthLength = 10; // int.MaxValue.ToString().Length
int tokenLength = 4; // 4 == $1 + $2, see below what $1 and $2 are
// Note that the reserved space will be in the form +++++++++++++++++++
string reservedSpace = new string('+', maxRecCountLength + maxRecLengthLength - tokenLength);
string fileName = "out.xml";
int recCount = 0;
int maxRecLength = 0;
using (var sw = new StreamWriter(fileName))
{
// Here you write on your StreamWriter however you want.
// Note that recCount and recLength have a placeholder $1 and $2.
using (var xw = XmlWriter.Create(sw))
{
xw.WriteWhitespace("\r\n");
xw.WriteStartElement("Table");
xw.WriteAttributeString("recCount", "$1");
xw.WriteAttributeString("recLength", "$2");
// You have to add some white space that will be
// partially replaced by the recCount and recLength value
xw.WriteWhitespace("\r\n");
xw.WriteComment("Reserved space:" + reservedSpace);
// <--------- BEGIN YOUR CODE
for (int i = 0; i < 100; i++)
{
xw.WriteWhitespace("\r\n");
xw.WriteStartElement("Rec");
string str = string.Format("Some number: {0}", i);
if (str.Length > maxRecLength)
{
maxRecLength = str.Length;
}
xw.WriteValue(str);
recCount++;
xw.WriteEndElement();
}
// <--------- END YOUR CODE
xw.WriteWhitespace("\r\n");
xw.WriteEndElement();
}
}
var lines = new List<string>();
using (var sr = new StreamReader(fileName))
{
// Now we read the first lines to modify them (normally we will
// read three lines, the xml header, the <Table element and the
// <-- Reserved space:
while (true)
{
string str = sr.ReadLine();
lines.Add(str);
if (str.StartsWith("<Table"))
{
// We read the next line, the comment line
str = sr.ReadLine();
lines.Add(str);
break;
}
}
}
// We have to use the Stream overload of StreamWriter because
// we want to modify the text!
using (var fs = File.OpenWrite(fileName))
using (var sw = new StreamWriter(fs))
{
string strCount = XmlConvert.ToString(recCount);
string strMaxRecLength = XmlConvert.ToString(maxRecLength);
// We do some replaces for the tokens
int oldLen = lines[lines.Count - 2].Length;
lines[lines.Count - 2] = lines[lines.Count - 2].Replace("=\"$1\"", string.Format("=\"{0}\"", strCount));
lines[lines.Count - 2] = lines[lines.Count - 2].Replace("=\"$2\"", string.Format("=\"{0}\"", strMaxRecLength));
int newLen = lines[lines.Count - 2].Length;
// Remove spaces from reserved whitespace
lines[lines.Count - 1] = lines[lines.Count - 1].Replace(":" + reservedSpace, ":" + new string('#', reservedSpace.Length - newLen + oldLen));
// We move back to just after the UTF8/UTF16 preamble
sw.BaseStream.Position = sw.Encoding.GetPreamble().Length;
// And we rewrite the lines
foreach (string str in lines)
{
sw.Write(str);
sw.Write("\r\n");
}
}
Upvotes: 1
Reputation: 14251
Try to use the following approach.
You can set the default value to the attributes in the external xml schema.
When creating an xml document, you do not create these attributes. Here it is:
int count = 5;
int length = 42;
var writerSettings = new XmlWriterSettings { Indent = true };
using (var writer = XmlWriter.Create("data.xml", writerSettings))
{
writer.WriteStartElement("Table");
for (int i = 1; i <= count; i++)
{
writer.WriteStartElement("Rec");
writer.WriteAttributeString("recId", i.ToString());
writer.WriteString("..");
writer.WriteEndElement();
}
}
Thus, the xml looks like this:
<?xml version="1.0" encoding="utf-8"?>
<Table>
<Rec recId="1">..</Rec>
<Rec recId="2">..</Rec>
<Rec recId="3">..</Rec>
<Rec recId="4">..</Rec>
<Rec recId="5">..</Rec>
</Table>
Now create an xml schema for this document, which will specify the default values to the desired attributes.
string ns = "http://www.w3.org/2001/XMLSchema";
using (var writer = XmlWriter.Create("data.xsd", writerSettings))
{
writer.WriteStartElement("xs", "schema", ns);
writer.WriteStartElement("xs", "element", ns);
writer.WriteAttributeString("name", "Table");
writer.WriteStartElement("xs", "complexType", ns);
writer.WriteStartElement("xs", "sequence", ns);
writer.WriteStartElement("xs", "any", ns);
writer.WriteAttributeString("processContents", "skip");
writer.WriteAttributeString("maxOccurs", "unbounded");
writer.WriteEndElement();
writer.WriteEndElement();
writer.WriteStartElement("xs", "attribute", ns);
writer.WriteAttributeString("name", "recCount");
writer.WriteAttributeString("default", count.ToString()); // <--
writer.WriteEndElement();
writer.WriteStartElement("xs", "attribute", ns);
writer.WriteAttributeString("name", "recLength");
writer.WriteAttributeString("default", length.ToString()); // <--
writer.WriteEndElement();
}
Or much easier to create a schema as following:
XNamespace xs = "http://www.w3.org/2001/XMLSchema";
var schema = new XElement(xs + "schema",
new XElement(xs + "element", new XAttribute("name", "Table"),
new XElement(xs + "complexType",
new XElement(xs + "sequence",
new XElement(xs + "any",
new XAttribute("processContents", "skip"),
new XAttribute("maxOccurs", "unbounded")
)
),
new XElement(xs + "attribute",
new XAttribute("name", "recCount"),
new XAttribute("default", count) // <--
),
new XElement(xs + "attribute",
new XAttribute("name", "recLength"),
new XAttribute("default", length) // <--
)
)
)
);
schema.Save("data.xsd");
Please note the writing of the variables count
and length
- there should be your data.
The resulting schema will look like this:
<?xml version="1.0" encoding="utf-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="Table">
<xs:complexType>
<xs:sequence>
<xs:any processContents="skip" maxOccurs="unbounded" />
</xs:sequence>
<xs:attribute name="recCount" default="5" />
<xs:attribute name="recLength" default="42" />
</xs:complexType>
</xs:element>
</xs:schema>
Now, when reading an xml document, you must to add this schema - the default attribute values will be taken from it.
XElement xml;
var readerSettings = new XmlReaderSettings();
readerSettings.ValidationType = ValidationType.Schema; // <--
readerSettings.Schemas.Add("", "data.xsd"); // <--
using (var reader = XmlReader.Create("data.xml", readerSettings)) // <--
{
xml = XElement.Load(reader);
}
xml.Save(Console.Out);
Console.WriteLine();
The result:
<Table recCount="5" recLength="42">
<Rec recId="1">..</Rec>
<Rec recId="2">..</Rec>
<Rec recId="3">..</Rec>
<Rec recId="4">..</Rec>
<Rec recId="5">..</Rec>
</Table>
Upvotes: 1
Reputation: 62
I think the FileStream Class will be helpful to you. Take a look at the Read and Write methods.
Upvotes: 0
Reputation: 382
You could try to load the xml file into a dataset as it would be easier to compute your attributes that way. Plus the memory management is done by the DataSet layer. Why not give it a try and let us all know too of the results.
Upvotes: 0