Reputation: 508
I am trying to write an XML parser to parse all courses from that a university offers given a calendar year and semester. In particular, I am trying to get the department acronym (i.e. FIN for Finance, etc), the Course Number (i.e. Math 415, 415 would be the number), the Course Name, and the number of credit hours the course is worth.
The file I am trying to parse can be found HERE
EDIT AND UPDATE
Upon reader deeper into XML parsing, and the best way to optimize it, I stumbled upon this blog POST
Assuming the results of the tests run in that article are both honest and accurate, it seems that XmlReader far outperforms both XDocument and XmlDocument, which verifies what is said in the great answers below. Having said that, I re-coded my parser class using XmlReader along with limiting the number of readers used in a single method.
Here is the new parser class:
public void ParseDepartments()
{
// Create reader for the given calendar year and semester xml file
using (XmlReader reader = XmlReader.Create(xmlPath)) {
reader.ReadToFollowing("subjects"); // Navigate to the element 'subjects'
while (!reader.EOF) {
string pth = reader.GetAttribute("href"); // Get department's xml path
string acro = reader.GetAttribute("id"); // Get the department's acronym
reader.Read(); // Read through current element, ensures we visit each element
if (acro != null && acro != string.Empty) { // If the acronym is valid, add it to the department list
deps.AddDepartment(acro, pth);
}
}
}
}
public void ParseDepCourses()
{
// Loop through all the departments, and visit there respective xml file
foreach (KeyValuePair<string, string> department in deps.DepartmentPaths) {
try {
using (XmlReader reader = XmlReader.Create(department.Value)) {
reader.ReadToFollowing("courses"); // Navigate to the element 'courses'
while (!reader.EOF) {
string pth = reader.GetAttribute("href");
string num = reader.GetAttribute("id");
reader.Read();
if (num != null && num != string.Empty) {
string crseName = reader.Value; // reader.Value is the element's value, i.e. <elementTag>Value</elementTag>
deps[department.Key].Add(new CourseObject(num, crseName, termID, pth)); // Add the course to the department's course list
}
}
}
} catch (WebException) { } // WebException is thrown (Error 404) when there is no xml file found, or in other words, the department has no courses
}
}
public void ParseCourseInformation()
{
Regex expr = new Regex(@"^\S(L*)\d\b|^\S(L*)\b|^\S\d\b|^\S\b", RegexOptions.IgnoreCase | RegexOptions.Singleline | RegexOptions.IgnorePatternWhitespace); // A regular expression that will check each section and determine if it is a 'Lecture' section, at which point, that section's xml file is visited, and instructor added
foreach (KeyValuePair<string, Collection<CourseObject>> pair in deps) {
foreach (CourseObject crse in pair.Value) {
try {
using (XmlReader reader = XmlReader.Create(crse.XmlPath)) {
reader.ReadToFollowing("creditHours"); // Get credit hours for the course
crse.ParseCreditHours(reader.Value); // Class method to parse the string and grab the correct integer values
reader.ReadToFollowing("sections"); // Navigate to the element 'sections'
while (!reader.EOF) {
string pth = reader.GetAttribute("href");
string crn = reader.GetAttribute("id");
reader.Read();
if (crn != null && crn != string.Empty) {
string sction = reader.Value;
if (expr.IsMatch(sction)) { // Check if sction is a 'Lecture' section
using (XmlReader reader2 = XmlReader.Create(pth)) { // Navigate to its xml file
reader2.ReadToFollowing("instructors"); // Navigate to the element 'instructors'
while (!reader2.EOF) {
string firstName = reader2.GetAttribute("firstName");
string lastName = reader2.GetAttribute("lastName");
reader2.Read();
if ((firstName != null && firstName != string.Empty) && (lastName != null && lastName != string.Empty)) { // Check and make sure its a valid name
string instr = firstName + ". " + lastName; // Concatenate into full name
crse.AddSection(pth, sction, crn, instr); // Add section to course
}
}
}
}
}
}
}
} catch (WebException) { } // No course/section information found
}
}
}
Although the execution of this code takes quite some time (anywhere between 10-30 min), it is expected given the large amount of data being parsed. Thanks to everyone who posted answers, it was much appreciated. I hope this helps any other people who may have similar problems/questions.
Thanks,
David
Upvotes: 0
Views: 2084
Reputation: 50336
Well, apparently loading the XML files is somehow slow (e.g. because they are big or because of the time it takes to download them) and using XDocument
will load and parse them wholly into memory even if you're only using a small portion of it. Doing this recursively three levels deep will make the whole process very slow, but eventually it will end (either as expected or through an OutOfMemoryException
).1
Take a look at the XmlReader
class. It allows you to read through an XML file sequentially, picking out whatever you need and aborting the reading whenever you've got all the information you require. However, it works quite a bit different from XDocument
and a lot less intuitive. There are some examples on that MSDN page, and on Stackoverflow too.
As a side note: when using the XmlReader
consider reading and closing the reader before you start reading another XML file. This keeps the memory footprint of your application minimal.
1) Consider, for example, that you're reading a file with 10 years of 3 seasons of 60 courses worth of files, then you're code is downloading, parsing, verifying and processing 10 * 3 * 60 = 1800 files. And it needs to download them from the (compared to your local pc) slow internet. Don't expect this whole process to be quick.
Upvotes: 3
Reputation: 726839
The loop does not become infinite, it simply becomes very, very slow. This is because the call of
XDocument hoursDoc = XDocument.Load(crsePath);
opens another XML file and parses it. Considering that the processing goes for 25 seconds when all information is in memory, it is not surprising that opening an additional file for each course that you encounter slows down the process to a crawl.
Upvotes: 1