SMUsamaShah
SMUsamaShah

Reputation: 7890

Fastest way to parse XML files in C#?

I have to load many XML files from internet. But for testing with better speed i downloaded all of them (more than 500 files) of the following format.

<player-profile>
  <personal-information>
    <id>36</id>
    <fullname>Adam Gilchrist</fullname>
    <majorteam>Australia</majorteam>
    <nickname>Gilchrist</nickname>
    <shortName>A Gilchrist</shortName>
    <dateofbirth>Nov 14, 1971</dateofbirth>
    <battingstyle>Left-hand bat</battingstyle>
    <bowlingstyle>Right-arm offbreak</bowlingstyle>
    <role>Wicket-Keeper</role>
    <teams-played-for>Western Australia, New South Wales, ICC World XI, Deccan Chargers, Australia</teams-played-for>
    <iplteam>Deccan Chargers</iplteam>
  </personal-information>
  <batting-statistics>
    <odi-stats>
      <matchtype>ODI</matchtype>
      <matches>287</matches>
      <innings>279</innings>
      <notouts>11</notouts>
      <runsscored>9619</runsscored>
      <highestscore>172</highestscore>
      <ballstaken>9922</ballstaken>
      <sixes>149</sixes>
      <fours>1000+</fours>
      <ducks>0</ducks>
      <fifties>55</fifties>
      <catches>417</catches>
      <stumpings>55</stumpings>
      <hundreds>16</hundreds>
      <strikerate>96.95</strikerate>
      <average>35.89</average>
    </odi-stats>
    <test-stats>
      .
      .
      .
    </test-stats>
    <t20-stats>
      .
      .
      .    
    </t20-stats>
    <ipl-stats>
      .
      .
      . 
    </ipl-stats>
  </batting-statistics>
  <bowling-statistics>
    <odi-stats>
      <matchtype>ODI</matchtype>
      <matches>378</matches>
      <ballsbowled>58</ballsbowled>
      <runsgiven>64</runsgiven>
      <wickets>3</wickets>
      <fourwicket>0</fourwicket>
      <fivewicket>0</fivewicket>
      <strikerate>19.33</strikerate>
      <economyrate>6.62</economyrate>
      <average>21.33</average>
    </odi-stats>
    <test-stats>
      .
      .
      . 
    </test-stats>
    <t20-stats>
      .
      .
      . 
    </t20-stats>
    <ipl-stats>
      .
      .
      . 
    </ipl-stats>
  </bowling-statistics>
</player-profile>

I am using

XmlNodeList list = _document.SelectNodes("/player-profile/batting-statistics/odi-stats");

And then loop this list with foreach as

foreach (XmlNode stats in list)
  {
     _btMatchType = GetInnerString(stats, "matchtype"); //it returns null string if node not availible
     .
     .
     .
     .
     _btAvg = Convert.ToDouble(stats["average"].InnerText);
  }

Even i am loading all files offline, parsing is very slow Is there any good faster way to parse them? Or is it problem with SQL? I am saving all extracted data from XML to database using DataSets, TableAdapters with insert command.

EDIT: Now for using XmlReader please give some code of XmlReader for above document. for now, i have done this

void Load(string url) 
{
    _reader = XmlReader.Create(url); 
    while (_reader.Read()) 
    { 
    } 
} 

Availible Methods for XmlReader are confusing. What i need is to get batting and bowling stats completly, batting and bowling stats are different, while odi,t2o,ipl etc are same inside bowling and batting.

Upvotes: 4

Views: 16909

Answers (8)

Robert Rossney
Robert Rossney

Reputation: 96750

The overhead of throwing exceptions probably dwarfs the overhead of XML parsing. You need to rewrite your code so that it doesn't throw exceptions.

One way is to check for the existence of an element before you ask for its value. That will work, but it's a lot of code. Another way to do it would be to use a map:

Dictionary<string, string> map = new Dictionary<string, string>
{
  { "matchtype", null },
  { "matches", null },
  { "ballsbowled", null }
};

foreach (XmlElement elm in stats.SelectNodes("*"))
{
   if (map.ContainsKey(elm.Name))
   {
      map[elm.Name] = elm.InnerText;
   }
}

This code will handle all the elements whose names you care about and ignore the ones you don't. If the value in the map is null, it means that an element with that name didn't exist (or had no text).

In fact, if you're putting the data into a DataTable, and the column names in the DataTable are the same as the element names in the XML, you don't even need to build a map, since the DataTable.Columns property is all the map you need. Also, since the DataColumn knows what data type it contains, you don't have to duplicate that knowledge in your code:

foreach (XmlElement elm in stats.SelectNodes("*"))
{
   if (myTable.Columns.Contains(elm.Name))
   {
      DataColumn c = myTable.Columns[elm.Name];
      if (c.DataType == typeof(string))
      {          
         myRow[elm.Name] = elm.InnerText;
         continue;
      }
      if (c.DataType == typeof(double))
      {
         myRow[elm.Name] = Convert.ToDouble(elm.InnerText);
         continue;
      }
      throw new InvalidOperationException("I didn't implement conversion logic for " + c.DataType.ToString() + ".");
   }
}

Note how I'm not declaring any variables to store this information in, so there's no chance of me screwing up and declaring a variable of a data type different from the column it's stored in, or creating a column in my table and forgetting to implement the logic that populates it.

Edit

Okay, here's something that's a bit tricksy. This is a pretty common technique in Python; in C# I think most people still think there something weird about it.

If you look at the second example I gave, you can see that it's using the metainformation in the DataColumn to figure out what logic to use for converting an element's value from text to its base type. You can accomplish the same thing by building your own map, e.g.:

Dictionary<string, Type> typeMap = new Dictionary<string, Type>
{
   { "matchtype", typeof(string) },
   { "matches", typeof(int) },
   { "ballsbowled", typeof(int) }
}

and then do pretty much the same thing I showed in the second example:

if (typeMap[elm.Name] == typeof(int))
{
   result[elm.Name] = Convert.ToInt32(elm.Text);
   continue;
}

Your results can no longer be a Dictionary<string, string>, since now they can contain things that aren't strings; they have to be a Dictionary<string, object>.

But that logic seems a little ungainly; you're testing each item several times, there are continue statements to break out of it - it's not terrible, but it could be more concise. How? By using another map, one that maps types to conversion functions:

Dictionary<Type, Func<string, object>> conversionMap = 
   new Dictionary<Type, Func<string, object>>
{
   { typeof(string), (x => x) },
   { typeof(int), (x => Convert.ToInt32(x)) },
   { typeof(double), (x => Convert.ToDouble(x)) },
   { typeof(DateTime), (x => Convert.ToDateTime(x) }
};

That's a little hard to read, if you're not used to lambda expressions. The type Func<string, object> specifies a function that takes a string as its argument and returns an object. And that's what the values in that map are: they're lambda expressions, which is to say functions. They take a string argument (x), and they return an object. (How do we know that x is a string? The Func<string, object> tells us.)

This means that converting an element can take one line of code:

result[elm.Name] = conversionMap[typeMap[elm.Name]](elm.Text);

Go from the inner to the outer expression: this looks up the element's type in typeMap, and then looks up the conversion function in conversionMap, and calls that function, passing it elm.Text as an argument.

This may not be the ideal approach in your case. I really don't know. I show it here because there's a bigger issue at play. As Steve McConnell points out in Code Complete, it's easier to debug data than it is to debug code. This technique lets you turn program logic into data. There are cases where using this technique vastly simplifies the structure of your program. It's worth understanding.

Upvotes: 8

Ron Savage
Ron Savage

Reputation: 11079

If you are already converting that information into a DataSet to insert it into tables, just use DataSet.ReadXML() - and work with the default tables it creates from the data.

This toy app does that, and it works with the format you defined above.

Project file: http://www.dot-dash-dot.com/files/wtfxml.zip Installer: http://www.dot-dash-dot.com/files/WTFXMLSetup_1_8_0.msi

It lets you browse edit your XML file using a tree and grid format - the tables listed in the grid are the ones automatically created by the DataSet after ReadXML().

Upvotes: 0

Sudesh Sawant
Sudesh Sawant

Reputation: 147

An XmlReader is the solution for your problem. An XmlDocument stores lots of meta-information making the Xml easy to access, but it becomes too heavy on memory. I have seen some Xmls of size less than 50 KB being converted to few MBs (10 or something) of XmlDocument.

Upvotes: 0

Joshua Muskovitz
Joshua Muskovitz

Reputation: 362

If you know that the XML is consistent and well formed, you can simply avoid doing real XML parsing and just process them as flat text files. This is risky, non-portable, and brittle.

But it'll be the fastest (to run, not to code) solution.

Upvotes: 3

Chandam
Chandam

Reputation: 653

You could try LINQ to XML. Or you can use this to figure out what to use.

Upvotes: 2

djangofan
djangofan

Reputation: 29669

I wouldn't say LINQ is the best approach. I searched Google and I saw some references to HTML Agility Pack .

I think that if your going to have a speed bottleneck, it will be with your download process. In other words, it appears that your performance problems are not with your XML code. I think there are ways to improve your download speeds maybe or your file i/o but I don't know what they would be.

Upvotes: 0

Carra
Carra

Reputation: 17964

You can use an XmlReader for forward only, fast reading.

Upvotes: 9

Adrian
Adrian

Reputation: 2364

If the documents are large, then a stream-based parser (which is fine for your needs) will be faster than using XmlDocument, mostly because of the lower overhead. Check out the documentation for XmlReader.

Upvotes: 0

Related Questions