R11G
R11G

Reputation: 1980

Multiple whitespaces removed from XML when they should not be

There is a bug in this C++ code. It replaces multiple whitespaces between words by one space. Can't figure out where it is. It shouldn't trim whitespaces between two words and replace them by one. This is the method which deals with the whitespaces and blanks.

const char* TiXmlBase::SkipWhiteSpace( const char* p, TiXmlEncoding encoding )
{
    if ( !p || !*p )
    {
        return 0;
    }
    if ( encoding == TIXML_ENCODING_UTF8 )
    {
        while ( *p )
        {
            const unsigned char* pU = (const unsigned char*)p;

            if (    *(pU+0)==TIXML_UTF_LEAD_0
                 && *(pU+1)==TIXML_UTF_LEAD_1 
                 && *(pU+2)==TIXML_UTF_LEAD_2 )
            {
                p += 3;
                continue;
            }
            else if(*(pU+0)==TIXML_UTF_LEAD_0
                 && *(pU+1)==0xbfU
                 && *(pU+2)==0xbeU )
            {
                p += 3;
                continue;
            }
            else if(*(pU+0)==TIXML_UTF_LEAD_0
                 && *(pU+1)==0xbfU
                 && *(pU+2)==0xbfU )
            {
                p += 3;
                continue;
            }

            if ( IsWhiteSpace( *p ) )        // Still using old rules for white space.
                p++;
            else
                break;
        }
    }
    else
    {
        while ( *p && IsWhiteSpace( *p ) )
             // while(*p)
            ++p;
    }

    return p;
}

Input:

<?xml version="1.0" standalone="no" ?>
<ToDo>
        <bold>Toy                                           store!</bold>
</ToDo>

Expected output:

<?xml version="1.0" standalone="no" ?>
<ToDo>
        <bold>Toy                                           store!</bold>
</ToDo>

Observed output:

<?xml version="1.0" standalone="no" ?>
<ToDo>
    <bold>Toy store!</bold>
</ToDo>

Upvotes: 3

Views: 2570

Answers (2)

bcdan
bcdan

Reputation: 1428

Try setting bool TiXmlBase::condenseWhiteSpace to false in the file tinyxml.cpp, or calling TiXmlBase::SetCondenseWhiteSpace(false) in runtime. The first worked for me.

This probably didn't exist in 2012, but it exists now.

Upvotes: 0

sehe
sehe

Reputation: 392921

Switch to TinyXML-2:

Advantages of TinyXML-2

  • The focus of all future dev.
  • Many fewer memory allocation (1/10th to 1/100th), uses less memory (about 40% of TinyXML-1), and faster.
  • No STL requirement.
  • More modern C++, including a proper namespace.
  • Proper and useful handling of whitespace

White Space

Microsoft has an excellent article on white space: http://msdn.microsoft.com/en-us/library/ms256097.aspx

TinyXML-2 preserves white space in a (hopefully) sane way that is almost complient with the spec.(TinyXML-1 used a completely outdated model.)

As a first step, all newlines / carriage-returns / line-feeds are normalized to a line-feed character, as required by the XML spec.

White space in text is preserved. For example:

<element> Hello,  World</element>

The leading space before the "Hello" and the double space after the comma are preserved. Line-feeds are preserved, as in this example:

<element> Hello again,  
          World</element>

However, white space between elements is not preserved. Although not strictly compliant, tracking and reporting inter-element space is awkward, and not normally valuable. TinyXML-2 sees these as the same XML:

<document>
<data>1</data>
<data>2</data>
<data>3</data>
</document>

<document><data>1</data><data>2</data><data>3</data></document>

Upvotes: 5

Related Questions