kocica
kocica

Reputation: 6465

Missed element values when parsing XML file using libxml2

Iam parsing specific tags (eg. titles) from XML file using libxml2.

Parsing this XML:

<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
  <entry>
    <title type="html">Swedish ISP spanked for sexist 'distracted boyfriend' advert for developer jobs1</title>
  </entry>
  <entry>
    <title type="html">Swedish ISP spanked for sexist 'distracted boyfriend' advert for developer jobs2</title>
  </entry>
  <entry>
    <title type="html">Swedish ISP spanked for sexist 'distracted boyfriend' advert for developer jobs3</title>
  </entry>
  <entry>
    <title type="html">Swedish ISP spanked for sexist 'distracted boyfriend' advert for developer jobs4</title>
  </entry>
  <entry>
    <title type="html">Swedish ISP spanked for sexist 'distracted boyfriend' advert for developer jobs5</title>
  </entry>
  <entry>
    <title type="html">Swedish ISP spanked for sexist 'distracted boyfriend' advert for developer jobs6</title>
  </entry>
  <entry>
    <title type="html">Swedish ISP spanked for sexist 'distracted boyfriend' advert for developer jobs7</title>
  </entry>
  <entry>
    <title type="html">Swedish ISP spanked for sexist 'distracted boyfriend' advert for developer jobs8</title>
  </entry>
  <entry>
    <title type="html">Swedish ISP spanked for sexist 'distracted boyfriend' advert for developer jobs9</title>
  </entry>
  <entry>
    <title type="html">Swedish ISP spanked for sexist 'distracted boyfriend' advert for developer jobs10</title>
  </entry>
</feed>

Using this C++ code

void CXMLManager::processNode(xmlTextReaderPtr reader)
{
    static bool root = true;
    std::string name;

    name  = std::string((const char *) xmlTextReaderConstName (reader));

    if (name == "entry")
    {
        if (root)
        {
            m_name = m_title;
            root = false;
            return;
        }

        static bool closeEntry = true;

        if (closeEntry)
        {
            m_feedBuffer.push_back( CFeed { m_name, m_title, m_updated, m_author, m_link } );

            m_title = "";
        }

        closeEntry = !closeEntry;
    }
    else if (name == "title" && xmlTextReaderNodeType(reader) != XML_READER_TYPE_END_ELEMENT)
    {
        m_title = getElementContent(reader);
        std::cout << "Title: " << m_title << std::endl;
    }
}

std::string CXMLManager::getElementContent(xmlTextReaderPtr reader)
{
    xmlNodePtr node = xmlTextReaderCurrentNode(reader);
    xmlChar* text   = xmlNodeGetContent(node);
    return std::string((const char *) text);
}

void CXMLManager::streamFile(const char *data, size_t size)
{
    xmlTextReaderPtr reader;
    int ret;

    /*
     * Pass some special parsing options to activate DTD attribute defaulting,
     * entities substitution and DTD validation
     */
    reader = xmlReaderForMemory(data, size, NULL, NULL,
                XML_PARSE_DTDATTR |  /* default DTD attributes */
                XML_PARSE_NOENT);    /* substitute entities */

    if (reader != NULL)
    {
        ret = xmlTextReaderRead(reader);

        while (ret == 1)
        {
            processNode(reader);
            ret = xmlTextReaderRead(reader);
        }
    }
    else
    {
        throw CFeedreaderException("FEEDREADER: Failed to parse XML.", E_WRONG_XML);
    }
}

and in the most cases, iam getting correct result, but once a time -- iam getting empty string (even thought its correct in XML):

Swedish ISP spanked for sexist 'distracted boyfriend' advert for developer jobs1
Swedish ISP spanked for sexist 'distracted boyfriend' advert for developer jobs2
Swedish ISP spanked for sexist 'distracted boyfriend' advert for developer jobs3
Swedish ISP spanked for sexist 'distracted boyfriend' advert for developer jobs4

Swedish ISP spanked for sexist 'distracted boyfriend' advert for developer jobs6
Swedish ISP spanked for sexist 'distracted boyfriend' advert for developer jobs7
Swedish ISP spanked for sexist 'distracted boyfriend' advert for developer jobs8

I have checked XML many times before parsing and its correct, so i dont know what could be the problem here. The 5th string is missed periodicaly with this input.

Upvotes: 2

Views: 404

Answers (1)

Remy Lebeau
Remy Lebeau

Reputation: 597111

The static local variables are likely throwing off your processing. Remember that a static local variable persists its value between function invocations. Once streamFile() exits, and is then called again, your static variables will still have their previous values, they will not be reset back to their original values. You would have to change them to be members of your CXMLManager class instead so that streamFile() can reset them each time it is called.

I don't suggest using a single function to try to handle every possible node you need to parse. I would break up the reading into separate functions that have their own responsibilities at each level of the XML document, something like this:

void CXMLManager::readFeed(xmlTextReaderPtr reader)
{
    // read attributes if needed...

    if (xmlTextReaderIsEmptyElement(reader))
        return;

    int depth = xmlTextReaderNodeDepth(reader);
    int ret;

    while ((ret = xmlTextReaderRead(reader)) == 1)
    {
        switch (xmlTextReaderNodeType(reader))
        {
            case XML_READER_TYPE_ELEMENT:
            {
                if (xmlStrEqual(xmlTextReaderConstLocalName(reader), BAD_CAST "entry"))
                {
                    CFeed entry;
                    readFeedEntry(reader, entry);
                    m_feedBuffer.push_back(entry);
                }
                break;
            }

            case XML_READER_TYPE_END_ELEMENT:
            {
                if ((xmlTextReaderNodeDepth(reader) == depth)
                    /*&& xmlStrEqual(xmlTextReaderConstLocalName(reader), BAD_CAST "feed")*/)
                {
                    return;
                }
                break;
            }
        }
    }

    if (ret == -1)
        throw CFeedreaderException("FEEDREADER: Failed to read XML.", ...);
}

void CXMLManager::readFeedEntry(xmlTextReaderPtr reader, CFeed &entry)
{
    // read attributes if needed...

    if (xmlTextReaderIsEmptyElement(reader))
        return;

    int depth = xmlTextReaderNodeDepth(reader);
    int ret;

    while ((ret = xmlTextReaderRead(reader)) == 1)
    {
        switch (xmlTextReaderNodeType(reader))
        {
            case XML_READER_TYPE_ELEMENT:
            {
                const xmlChar *name = xmlTextReaderConstLocalName(reader);

                if (xmlStrEqual(name, BAD_CAST "title"))
                {
                    readText(reader, entry.m_title/*, BAD_CAST "title"*/);
                    std::cout << "Title: " << entry.m_title << std::endl;
                }
                // else other <entry> children as needed ...

                break;
            }

            case XML_READER_TYPE_END_ELEMENT:
            {
                if ((xmlTextReaderNodeDepth(reader) == depth)
                    /*&& xmlStrEqual(xmlTextReaderConstLocalName(reader), BAD_CAST "entry")*/)
                {
                    return;
                }
                break;
            }
        }
    }

    if (ret == -1)
        throw CFeedreaderException("FEEDREADER: Failed to read XML.", ...);
}

void CXMLManager::readText(xmlTextReaderPtr reader, std::string &text/*, const xmlChar *tagName */)
{
    text.clear();

    if (xmlTextReaderIsEmptyElement(reader))
        return;

    int depth = xmlTextReaderNodeDepth(reader);
    int ret;

    while ((ret = xmlTextReaderRead(reader)) == 1)
    {
        switch (xmlTextReaderNodeType(reader))
        {
            // TODO: handle XML_READER_TYPE_ELEMENT if you need to treat
            // embedded XML elements as part of the text, such as for
            // formatting instructions (like <b>, <i>, etc)...

            case XML_READER_TYPE_TEXT:
            {
                const xmlChar *value = xmlTextReaderConstValue(reader);
                text += reinterpret_cast<const char*>(value);
                break;
            }

            case XML_READER_TYPE_END_ELEMENT:
            {
                if ((xmlTextReaderNodeDepth(reader) == depth)
                    /*&& xmlStrEqual(name, tagName)*/)
                {
                    return;
                }
                break;
            }
        }
    }

    if (ret == -1)
        throw CFeedreaderException("FEEDREADER: Failed to read XML.", ...);
}

void CXMLManager::streamFile(const char *data, size_t size)
{
    /*
     * Pass some special parsing options to activate DTD attribute defaulting,
     * entities substitution and DTD validation
     */
    xmlTextReaderPtr reader = xmlReaderForMemory(data, size, NULL, NULL,
                XML_PARSE_DTDATTR |  /* default DTD attributes */
                XML_PARSE_NOENT);    /* substitute entities */

    if (!reader)
        throw CFeedreaderException("FEEDREADER: Failed to parse XML.", E_WRONG_XML);

    std::unique_ptr<xmlTextReader, decltype(xmlFreeTextReader)> reader_deleter(reader, xmlFreeTextReader);
    int ret;

    while ((ret = xmlTextReaderRead(reader)) == 1)
    {
        if ((xmlTextReaderNodeType(reader) == XML_READER_TYPE_ELEMENT)
            && xmlStrEqual(xmlTextReaderConstLocalName(reader), BAD_CAST "feed"))
        {
            readFeed(reader);
        }
    }

    if (ret == -1)
        throw CFeedreaderException("FEEDREADER: Failed to read XML.", ...);
}

Alternatively, I would suggest getting rid of all the helper functions altogether and just do everything inside of streamFile() itself, using a local state machine while looping through the reader, eg:

void CXMLManager::streamFile(const char *data, size_t size)
{
    /*
     * Pass some special parsing options to activate DTD attribute defaulting,
     * entities substitution and DTD validation
     */
    xmlTextReaderPtr reader = xmlReaderForMemory(data, size, NULL, NULL,
                XML_PARSE_DTDATTR |  /* default DTD attributes */
                XML_PARSE_NOENT);    /* substitute entities */

    if (!reader)
        throw CFeedreaderException("FEEDREADER: Failed to parse XML.", E_WRONG_XML);

    std::unique_ptr<xmlTextReader, decltype(xmlFreeTextReader)> reader_deleter(reader, xmlFreeTextReader);

    std::string name, title, updated, author, link, text;
    int feedDepth = -1;
    int entryDepth = -1;
    int textDepth = -1;
    int ret;

    while ((ret = xmlTextReaderRead(reader)) == 1)
    {
        switch (xmlTextReaderNodeType(reader))
        {
            case XML_READER_TYPE_ELEMENT:
            {
                if (textDepth != -1)
                {
                    // TODO: handle this case if you need to treat embedded
                    // XML elements as part of the text, such as for formatting
                    // instructions (like <b>, <i>, etc)...
                    break;
                }

                const xmlChar *name = xmlTextReaderConstLocalName(reader);

                if (feedDepth == -1)
                {
                    if (xmlStrEqual(name, BAD_CAST "feed"))
                    {
                        // read attributes if needed...

                        feedDepth == xmlTextReaderNodeDepth(reader);
                    }
                }
                else if (entryDepth == -1)
                {
                    if (xmlStrEqual(name, BAD_CAST "entry"))
                    {
                        name = title = updated = author = link = text = "";

                        // read attributes if needed...

                        if (xmlTextReaderIsEmptyElement(reader))
                            m_feedBuffer.push_back( CFeed { name, title, updated, author, link } );
                        else
                            entryDepth == xmlTextReaderNodeDepth(reader);
                    }
                }
                else if (xmlStrEqual(name, BAD_CAST "title"))
                {
                    text.clear();
                    if (!xmlTextReaderIsEmptyElement(reader))
                        textDepth = xmlTextReaderNodeDepth(reader);
                    else
                        textDepth = -1;
                }
                // else other <entry> children as needed ...

                break;
            }

            case XML_READER_TYPE_TEXT:
            {
                if (textDepth != -1)
                {
                    const xmlChar *value = xmlTextReeaderConstValue(reader);
                    text += reinterpret_cast<const char*>(value);
                }

                break;
            }

            case XML_READER_TYPE_END_ELEMENT:
            {
                const xmlChar *name = xmlTextReaderConstLocalName(reader);

                if (textDepth != -1)
                {
                    if ((xmlTextReaderNodeDepth(reader) == textDepth)
                        /*&& xmlStrEqual(name, BAD_CAST "title")*/)
                    {
                        textDepth = -1;

                        title = text;
                        text.clear();

                        std::cout << "Title: " << title << std::endl;
                    }
                    // else other <entry> children as needed ...
                }
                else if (entryDepth != -1)
                {
                    if ((xmlTextReaderNodeDepth(reader) == entryDepth)
                        /*&& xmlStrEqual(name, BAD_CAST "entry")*/)
                    {
                        entryDepth = -1;
                        m_feedBuffer.push_back( CFeed { name, title, updated, author, link } );
                    }
                }
                else if (feedDepth != -1)
                {
                    if ((xmlTextReaderNodeDepth(reader) == feedDepth)
                        /*&& xmlStrEqual(name, BAD_CAST "feed")*/)
                    {
                        feedDepth = -1;
                    }
                }

                break;
            }
        }
    }

    if (ret == -1)
        throw CFeedreaderException("FEEDREADER: Failed to read XML.", ...);
}

Upvotes: 2

Related Questions