Optimizing C# Custom Xml Parser Class

Question

I am trying to write an XML parser to parse all courses from that a university offers given a calendar year and semester. In particular, I am trying to get the department acronym (i.e. FIN for Finance, etc), the Course Number (i.e. Math 415, 415 would be the number), the Course Name, and the number of credit hours the course is worth.

The file I am trying to parse can be found HERE

EDIT AND UPDATE

Upon reader deeper into XML parsing, and the best way to optimize it, I stumbled upon this blog POST

Assuming the results of the tests run in that article are both honest and accurate, it seems that XmlReader far outperforms both XDocument and XmlDocument, which verifies what is said in the great answers below. Having said that, I re-coded my parser class using XmlReader along with limiting the number of readers used in a single method.

Here is the new parser class:

        public void ParseDepartments()
        {
            // Create reader for the given calendar year and semester xml file
            using (XmlReader reader = XmlReader.Create(xmlPath)) {
                reader.ReadToFollowing("subjects");  // Navigate to the element 'subjects'
                while (!reader.EOF) {
                    string pth = reader.GetAttribute("href");  // Get department's xml path
                    string acro = reader.GetAttribute("id");  // Get the department's acronym
                    reader.Read();  // Read through current element, ensures we visit each element
                    if (acro != null && acro != string.Empty) {  // If the acronym is valid, add it to the department list
                        deps.AddDepartment(acro, pth);
                    }
                }
            }
        }

        public void ParseDepCourses()
        {
            //  Loop through all the departments, and visit there respective xml file
            foreach (KeyValuePair department in deps.DepartmentPaths) {
                try {
                    using (XmlReader reader = XmlReader.Create(department.Value)) {
                        reader.ReadToFollowing("courses");  // Navigate to the element 'courses'
                        while (!reader.EOF) {
                            string pth = reader.GetAttribute("href");
                            string num = reader.GetAttribute("id");
                            reader.Read();
                            if (num != null && num != string.Empty) {
                                string crseName = reader.Value;  //  reader.Value is the element's value, i.e. Value 
                                deps[department.Key].Add(new CourseObject(num, crseName, termID, pth)); // Add the course to the department's course list
                            }
                        }
                    }
                } catch (WebException) { }  // WebException is thrown (Error 404) when there is no xml file found, or in other words, the department has no courses
            }

        }

        public void ParseCourseInformation()
        {
            Regex expr = new Regex(@"^\S(L*)\d\b|^\S(L*)\b|^\S\d\b|^\S\b", RegexOptions.IgnoreCase | RegexOptions.Singleline | RegexOptions.IgnorePatternWhitespace);  // A regular expression that will check each section and determine if it is a 'Lecture' section, at which point, that section's xml file is visited, and instructor added
            foreach (KeyValuePair> pair in deps) {
                foreach (CourseObject crse in pair.Value) {
                    try {
                        using (XmlReader reader = XmlReader.Create(crse.XmlPath)) {
                            reader.ReadToFollowing("creditHours");  // Get credit hours for the course
                            crse.ParseCreditHours(reader.Value);  //  Class method to parse the string and grab the correct integer values
                            reader.ReadToFollowing("sections"); //  Navigate to the element 'sections'
                            while (!reader.EOF) {
                                string pth = reader.GetAttribute("href");
                                string crn = reader.GetAttribute("id");
                                reader.Read();
                                if (crn != null && crn != string.Empty) {
                                    string sction = reader.Value;
                                    if (expr.IsMatch(sction)) {  // Check if sction is a 'Lecture' section
                                        using (XmlReader reader2 = XmlReader.Create(pth)) {  //  Navigate to its xml file
                                            reader2.ReadToFollowing("instructors");  // Navigate to the element 'instructors'
                                            while (!reader2.EOF) {
                                                string firstName = reader2.GetAttribute("firstName");
                                                string lastName = reader2.GetAttribute("lastName");
                                                reader2.Read();
                                                if ((firstName != null && firstName != string.Empty) && (lastName != null && lastName != string.Empty)) { //  Check and make sure its a valid name
                                                    string instr = firstName + ". " + lastName;  //  Concatenate into full name
                                                    crse.AddSection(pth, sction, crn, instr);  //  Add section to course
                                                }
                                            }
                                        }
                                    }
                                }
                            }
                        }
                    } catch (WebException) { }  // No course/section information found
                }
            }
        }

Although the execution of this code takes quite some time (anywhere between 10-30 min), it is expected given the large amount of data being parsed. Thanks to everyone who posted answers, it was much appreciated. I hope this helps any other people who may have similar problems/questions.

Thanks,

David

Daniel A.A. Pelsmaeker · Accepted Answer

Well, apparently loading the XML files is somehow slow (e.g. because they are big or because of the time it takes to download them) and using XDocument will load and parse them wholly into memory even if you're only using a small portion of it. Doing this recursively three levels deep will make the whole process very slow, but eventually it will end (either as expected or through an OutOfMemoryException).¹

Take a look at the XmlReader class. It allows you to read through an XML file sequentially, picking out whatever you need and aborting the reading whenever you've got all the information you require. However, it works quite a bit different from XDocument and a lot less intuitive. There are some examples on that MSDN page, and on Stackoverflow too.

As a side note: when using the XmlReader consider reading and closing the reader before you start reading another XML file. This keeps the memory footprint of your application minimal.

¹) Consider, for example, that you're reading a file with 10 years of 3 seasons of 60 courses worth of files, then you're code is downloading, parsing, verifying and processing 10 * 3 * 60 = 1800 files. And it needs to download them from the (compared to your local pc) slow internet. Don't expect this whole process to be quick.

Optimizing C# Custom Xml Parser Class

Answers (2)

Related Questions