Reputation: 192
I have a class in my app that is not transforming my XML data as expected.
Below is an excerpt of the XML. The file size can be between 2 GB and 3 GB, and the data is a representation of Mutual Funds. Each Fund usually has managers associated with it, but it's possible that there are none listed. A Fund in the data can have multiple ManagerDetail nodes or can have no ManagerDetail nodes. Each manager can have multiple CollegeEducation nodes or no CollegeEducation nodes.
<MutualFund>
<ManagerList>
<ManagerDetail>
<ManagerRole>M</ManagerRole>
<ManagerId>7394</ManagerId>
<ManagerTenure>3.67</ManagerTenure>
<StartDate>2011-09-30</StartDate>
<OwnershipLevel>6</OwnershipLevel>
<GivenName>Stephen</GivenName>
<MiddleName>M.</MiddleName>
<FamilyName>Kane</FamilyName>
<Gender>M</Gender>
<Certifications>
<CertificationName>CFA</CertificationName>
</Certifications>
<CollegeEducations>
<CollegeEducation>
<School>University of Chicago</School>
<Year>1990</Year>
<Degree>M.B.A.</Degree>
</CollegeEducation>
<CollegeEducation>
<School>University of California - Berkeley</School>
<Year>1985</Year>
<Degree>B.S.</Degree>
<Major>Business</Major>
</CollegeEducation>
</CollegeEducations>
</ManagerDetail>
</ManagerList>
</MutualFund>
I've created a class that is called within a BackgroundWorker instance in another form. This class places the above data into the following table:
public static DataTable dtManagersEducation = new DataTable();
dtManagersEducation.Columns.Add("ManagerId");
dtManagersEducation.Columns.Add("Institution");
dtManagersEducation.Columns.Add("DegreeType");
dtManagersEducation.Columns.Add("Emphasis");
dtManagersEducation.Columns.Add("Year");
The method that places the XML data is set up like this. Basically, I have certain points where DataRows are created and completed, and certain XML data is to be placed into the available row as the data is read.
public static void Read(MainForm mf, XmlReader xml)
{
mainForm = mf;
xmlReader = xml;
while (xmlReader.Read() && mainForm.continueProcess)
{
if (xmlReader.Name == "CollegeEducation")
{
if (nodeIsElement())
{
drManagersEducation = dtManagersEducation.NewRow();
drManagersEducation["ManagerId"] = currentManager.morningstarManagerId;
}
else if (nodeIsEndElement())
{
dtManagersEducation.Rows.Add(drManagersEducation);
drManagersEducation = null;
}
}
else if (xmlReader.Name == "School")
{
if (nodeIsElement() && drManagersEducation != null)
{
string value = xmlReader.ReadElementContentAsString();
drManagersEducation["Institution"] = value;
}
}
else if (xmlReader.Name == "Year")
{
if (nodeIsElement() && drManagersEducation != null)
{
string value = xmlReader.ReadElementContentAsString();
drManagersEducation["Year"] = value;
}
}
else if (xmlReader.Name == "Degree")
{
if (nodeIsElement() && drManagersEducation != null)
{
string value = xmlReader.ReadElementContentAsString();
drManagersEducation["DegreeType"] = value;
}
}
else if (xmlReader.Name == "Major")
{
if (nodeIsElement() && drManagersEducation != null)
{
string value = xmlReader.ReadElementContentAsString();
drManagersEducation["Emphasis"] = value;
}
}
}
}
private static bool nodeIsElement()
{
return xmlReader.NodeType == XmlNodeType.Element;
}
private static bool nodeIsEndElement()
{
return xmlReader.NodeType == XmlNodeType.EndElement;
}
The result ends up with no data in the Emphasis or Year columns, which as you can see above, there are instances (plenty) that have data in these fields.
ManagerId Institution DegreeType Emphasis Year
5807 Yale University M.S.
9336 Yale University
7227 Yale University M.S.
Would you all happen to have some insight into what is going on?
Thanks
My sample XML data listed above has indented spaces, but the actual data that I was running through the XmlReader did not. As dbc has shown below, adding a variable bool readNext
fixed my issues. As I understand it, if readNext
is set to false when ReadElementContentAsString()
is called, the XmlReader will not call Read()
since my while loop condition now contains (!readNext || xmlReader.Read())
. This prevents the two methods ReadElementContentAsString()
and Read()
to be called right after another, and thus, it will not skip over data.
Thanks to dbc!
Upvotes: 2
Views: 187
Reputation: 117086
The problem you are seeing is that the method XmlReader.ReadElementContentAsString
moves the reader past the end element tag. If you then do xmlReader.Read()
unconditionally right afterwards, the node immediately after the end element tag will be skipped. In the XML shown in your question, the node immediately after your end element tags is whitespace, so the bug isn't reproducible with your question. But if I strip the indentation (and hopefully your 2+GB XML file has no indentation), the bug becomes reproducible.
Also, in your question, I don't see where you actually read the <ManagerId>7394</ManagerId>
tag. Instead you just take it from currentManager.morningstarManagerId
(an undefined global variable). I reckon that's a typo in your question, and in your actual code you read this somewhere.
Here's a version of your method that fixes these problems and can be compiled and tested standalone:
public static DataTable Read(XmlReader xmlReader, Func<bool> continueProcess)
{
DataTable dtManagersEducation = new DataTable();
dtManagersEducation.TableName = "ManagersEducation";
dtManagersEducation.Columns.Add("ManagerId");
dtManagersEducation.Columns.Add("Institution");
dtManagersEducation.Columns.Add("DegreeType");
dtManagersEducation.Columns.Add("Emphasis");
dtManagersEducation.Columns.Add("Year");
bool inManagerDetail = false;
string managerId = null;
DataRow drManagersEducation = null;
bool readNext = true;
while ((!readNext || xmlReader.Read()) && continueProcess())
{
readNext = true;
if (xmlReader.NodeType == XmlNodeType.Element)
{
if (!xmlReader.IsEmptyElement)
{
if (xmlReader.Name == "ManagerDetail")
{
inManagerDetail = true;
}
else if (xmlReader.Name == "ManagerId")
{
var value = xmlReader.ReadElementContentAsString();
readNext = false;
if (inManagerDetail)
managerId = value;
}
else if (xmlReader.Name == "School")
{
var value = xmlReader.ReadElementContentAsString();
readNext = false;
if (drManagersEducation != null)
drManagersEducation["Institution"] = value;
}
else if (xmlReader.Name == "Year")
{
var value = xmlReader.ReadElementContentAsString();
readNext = false;
if (drManagersEducation != null)
drManagersEducation["Year"] = value;
}
else if (xmlReader.Name == "Degree")
{
var value = xmlReader.ReadElementContentAsString();
readNext = false;
if (drManagersEducation != null)
drManagersEducation["DegreeType"] = value;
}
else if (xmlReader.Name == "Major")
{
var value = xmlReader.ReadElementContentAsString();
readNext = false;
if (drManagersEducation != null)
drManagersEducation["Emphasis"] = value;
}
else if (xmlReader.Name == "CollegeEducation")
{
if (managerId != null)
{
drManagersEducation = dtManagersEducation.NewRow();
drManagersEducation["ManagerId"] = managerId;
}
}
}
}
else if (xmlReader.NodeType == XmlNodeType.EndElement)
{
if (xmlReader.Name == "ManagerDetail")
{
inManagerDetail = false;
managerId = null;
}
else if (xmlReader.Name == "CollegeEducation")
{
if (drManagersEducation != null)
dtManagersEducation.Rows.Add(drManagersEducation);
drManagersEducation = null;
}
}
}
return dtManagersEducation;
}
Upvotes: 1