Why does OpenXML read rows twice

Question

I count rows in two worksheets like this:

foreach (WorksheetPart worksheetPart in workbookPart.WorksheetParts)
{
    OpenXmlPartReader reader = new OpenXmlPartReader(worksheetPart);
    if (count == 0)
    {
        while (reader.Read())
        {
            if (reader.ElementType == typeof(Row))
            {
                count_first++;
            }
        }
    }
    else if (count == 1)
    {
        while (reader.Read())
        {
            if (reader.ElementType == typeof(Row))
            {
                count_second++;
            }
        }
    }
    count++;
}

For both worksheets in count_first and count_second I get twice as much as there are rows with data. Why is that and what does it actually mean? Does it mean that OpenXML parses each list twice?

EDIT

Well, I found a solution. To get it right away, I guess, you should keep this sacred knowledge in some secret place. So, here it is:

while (reader.Read())
{
    if (reader.ElementType == typeof(Row))
    {
        do
        {
            count_first++;
        }   while (reader.ReadNextSibling());
    }
}

petelids · Accepted Answer

The reason you are getting twice the count is due to the way the OpenXmlReader reads each element. The reader treats the open and close nodes as independant items which can be differentiated by checking the IsStartElement and IsEndElement properties.

To demonstrate this you can run something like this:

using (OpenXmlReader reader = OpenXmlReader.Create(worksheetPart))
{
    while (reader.Read())
    {
        if (reader.ElementType == typeof(Row))
        {
            do
            {
                Console.WriteLine("{0} {1} {2}", 
                                  reader.ElementType,
                                  reader.IsStartElement,
                                  reader.IsEndElement);
            } while (reader.Read());

            Console.WriteLine("Finished");
        }
    }
}

Which will produce output along the lines of the following* for a sheet with two rows and two columns (I've highlighted the Rows for readibility):

Row True False
Cell True False
CellValue True False
CellValue False True
Cell False True
Cell True False
CellValue True False
CellValue False True
Cell False True
Row False True
Row True False
Cell True False
CellValue True False
CellValue False True
Cell False True
Cell True False
CellValue True False
CellValue False True
Cell False True
Row False True

There are 2 ways that you can solve this depending on how you want to read the document. The first way (as you point out in your answer) is to move to the next sibling by calling ReadNextSibling - this essentially "jumps" the end element (and any children of the Row). Changing the above example to use ReadNextSibling in the do loop:

do
{
    Console.WriteLine("{0} {1} {2}", 
                       reader.ElementType,
                       reader.IsStartElement,
                       reader.IsEndElement);
} while (reader.ReadNextSibling());

You'll get output* of:

Row True False
Row True False

The second way would be to just count the start elements (or indeed the end elements; just not both):

while (reader.Read())
{
    if (reader.ElementType == typeof(Row) && reader.IsStartElement)
    {
        count_first++;
    }
}

Which one you choose depends on whether you wish to read the Cell values and how you'd like to read them (SAX or DOM).

* In reality each row is prefixed with the namespace of "DocumentFormat.OpenXml.Spreadsheet." which I've removed for readibility.

Why does OpenXML read rows twice

Answers (2)

Related Questions