How can I keep “ when I use QDomDocument to parse html data?

Question

void test()
    {
        QDomDocument doc("doc");
        QByteArray data = "Of course, “Jason.” My thoughts, exactly.";

        QString sErrorMsg;
        int errLine, errCol;

        if (!doc.setContent(data, &sErrorMsg, &errLine, &errCol)) {
            qDebug() << sErrorMsg;
            qDebug() << errLine << ":" << errCol;
            return;
        }

        QDomNodeList pList = doc.elementsByTagName("p");
        for (int i = 0; i < pList.size(); i++)
        {
            QDomNode p = pList.at(i);
            while (!p.isNull()) {
                QDomElement e = p.toElement(); 
                if (!e.isNull()) {
                    QByteArray ba = e.text().toUtf8(); //Here, there is no left and right quota marks anymore.

                }
                p = p.nextSibling();
            }
        }

    }

I'm parsing a html phrase with “and ”. The code runs to QByteArray ba = e.text().toUtf8(); without the quota marks.

How do I keep them?

Scheff&#39;s Cat · Accepted Answer

I must admit that this is the first time that I used QDomDocument although I already have some experience with XML in general and libXml2 specifically.

First, I can confirm that QDomElement::text() returns text without the typographical quotes encoded by entities.

I modified the MCVE of OP a bit and now, it should be obvious why this happens.

My testQDomDocument.cc:

#include 

static const char* toString(QDomNode::NodeType nodeType);

int main(int, char**)
{
  QByteArray text = "Of course, “Jason.” My thoughts, exactly.";
  // setup doc. DOM
  QDomDocument qDomDoc("doc");
  QString qErrorMsg; int errorLine = 0, errorCol = 0;
  if (!qDomDoc.setContent(text, &qErrorMsg, &errorLine, &errorCol)) {
    qDebug() << "Line:" << errorLine << "Col.:" << errorCol << qErrorMsg;
    return 1;
  }
  // inspect DOM
  QDomNodeList qListP = qDomDoc.elementsByTagName("p");
  const int nP = qListP.size();
  qDebug() << "Number of found  nodes:" << nP;
  for (int i = 0; i < nP; ++i) {
    const QDomNode qNodeP = qListP.at(i);
    qDebug() << "node  #" << i;
    qDebug() << "node.toElement().text(): " << qNodeP.toElement().text();
    for (QDomNode qNode = qNodeP.firstChild(); !qNode.isNull(); qNode = qNode.nextSibling()) {
      qDebug() << toString(qNode.nodeType());
      switch (qNode.nodeType()) {
        case QDomNode::TextNode:
#if 1 // IMHO, the correct way:
          qDebug() << qNode.toText().data();
#else // works as well:
          qDebug() << qNode.nodeValue();
#endif // 1
          break;
        case QDomNode::EntityReferenceNode:
          qDebug() << qNode.nodeName();
          break;
        default:; // rest of types left out to keep sample short
      }
    }
  }
  // done
  return 0;
}

const char* toString(QDomNode::NodeType nodeType)
{
  static const std::map mapNodeTypes {
    { QDomNode::ElementNode, "QDomNode::ElementNode" },
    { QDomNode::AttributeNode, "QDomNode::AttributeNode" },
    { QDomNode::TextNode, "QDomNode::TextNode" },
    { QDomNode::CDATASectionNode, "QDomNode::CDATASectionNode" },
    { QDomNode::EntityReferenceNode, "QDomNode::EntityReferenceNode" },
    { QDomNode::EntityNode, "QDomNode::EntityNode" },
    { QDomNode::ProcessingInstructionNode, "QDomNode::ProcessingInstructionNode" },
    { QDomNode::CommentNode, "QDomNode::CommentNode" },
    { QDomNode::DocumentNode, "QDomNode::DocumentNode" },
    { QDomNode::DocumentTypeNode, "QDomNode::DocumentTypeNode" },
    { QDomNode::DocumentFragmentNode, "QDomNode::DocumentFragmentNode" },
    { QDomNode::NotationNode, "QDomNode::NotationNode" },
    { QDomNode::BaseNode, "QDomNode::BaseNode" },
    { QDomNode::CharacterDataNode, "QDomNode::CharacterDataNode" }
  };
  const std::map::const_iterator iter
    = mapNodeTypes.find(nodeType);
  return iter != mapNodeTypes.end() ? iter->second : "";
}

The Qt project file – testQDomDocument.pro:

SOURCES = testQDomDocument.cc

QT += xml

Build and test:

$ qmake-qt5 testQDomDocument.pro $ make && ./testQDomDocument g++ -c -fno-keep-inline-dllexport -D_GNU_SOURCE -pipe -O2 -Wall -W -D_REENTRANT -DQT_NO_DEBUG -DQT_GUI_LIB -DQT_XML_LIB -DQT_CORE_LIB -I. -isystem /usr/include/qt5 -isystem /usr/include/qt5/QtGui -isystem /usr/include/qt5/QtXml -isystem /usr/include/qt5/QtCore -I. -I/usr/lib/qt5/mkspecs/cygwin-g++ -o testQDomDocument.o testQDomDocument.cc g++ -o testQDomDocument.exe testQDomDocument.o -lQt5Gui -lQt5Xml -lQt5Core -lGL -lpthread Number of found nodes: 1 node

# 0 node.toElement().text(): "Of course, Jason. My thoughts, exactly." QDomNode::TextNode "Of course, " QDomNode::EntityReferenceNode "ldquo" QDomNode::TextNode "Jason." QDomNode::EntityReferenceNode "rdquo" QDomNode::TextNode " My thoughts, exactly." $

To understand what happened it helps to know that the contents of

isn't stored in the QDomNode instance for

directly. Instead, the QDomNode instance for

(as well as any other element) has child nodes to store its contents, e.g. a QDomText instance to store a piece of text.

So, the QDomElement::text() is a convenience function which returns only the (collected) text but seems to ignore any other nodes. In OPs sample, not all child nodes of the QDomElement for

are text nodes.

The entities (“, ”) are stored as QDomEntityReference instances and obviously skipped in QDomElement::text().

I must admit I was a bit surprised because (according to my experience in libXml2) I'm used to the fact that entities are resolved into text as well.

The paragraph in QDomEntityReference:

Moreover, the XML processor may completely expand references to entities while building the DOM tree, instead of providing QDomEntityReference objects.

supported my same expectation for QDomDocument.

However, the sample shows that this isn't true in this case.

Thinking twice, I realized that “ and ” are not predefined entities in XML.

This is the case in HTML5 (and before) but not in general XML.

The only predefined entities in XML are:

Name | Chr. | Codepoint   | Meaning
-----+------+-------------+-----------------
quot |  "   | U+0022 (34) | quotation mark
amp  |  &   | U+0026 (38) | ampersand
apos |  '   | U+0027 (39) | apostrophe
lt   |  <   | U+003C (60) | less-than sign
gt   |  >   | U+003E (62) | greater-than sign

So, for the replacement of HTML entities, something else is needed in QDomDocument.

Btw. while looking for a hint into this direction, I stumbled into:

SO: QDomDocument fails to set content of an HTML document with tag

I thought a while about how this can be fixed.

I wonder that I didn't think immediately on a very simple fix: replacing the entities by numeric character references.

HTML Entity | NCR
------------+----------
“     | “
”     | ”

With a slight modification of the above sample:

int main(int, char**)
{
  QByteArray text =
    "Of course, “Jason.” My thoughts, exactly.";
  // setup doc. DOM
  QDomDocument qDomDoc("doc");
  QString qErrorMsg; int errorLine = 0, errorCol = 0;
  if (!qDomDoc.setContent(text, &qErrorMsg, &errorLine, &errorCol)) {
    qDebug() << "Line:" << errorLine << "Col.:" << errorCol << qErrorMsg;
    return 1;
  }
  // inspect DOM
  QDomNodeList qListP = qDomDoc.elementsByTagName("p");
  const int nP = qListP.size();
  qDebug() << "Number of found  nodes:" << nP;
  for (int i = 0; i < nP; ++i) {
    const QDomNode qNodeP = qListP.at(i);
    qDebug() << "node  #" << i;
    qDebug() << "node.toElement().text(): " << qNodeP.toElement().text().toUtf8();
    for (QDomNode qNode = qNodeP.firstChild(); !qNode.isNull(); qNode = qNode.nextSibling()) {
      qDebug() << toString(qNode.nodeType());
      switch (qNode.nodeType()) {
        case QDomNode::TextNode:
          qDebug() << qNode.toText().data().toUtf8();
          break;
        case QDomNode::EntityReferenceNode:
          qDebug() << qNode.nodeName();
          break;
        default:; // rest of types left out to keep sample short
      }
    }
  }
  // done
  return 0;
}

I got the following output:

$ make && ./testQDomDocument g++ -c -fno-keep-inline-dllexport -D_GNU_SOURCE -pipe -O2 -Wall -W -D_REENTRANT -DQT_NO_DEBUG -DQT_GUI_LIB -DQT_XML_LIB -DQT_CORE_LIB -I. -isystem /usr/include/qt5 -isystem /usr/include/qt5/QtGui -isystem /usr/include/qt5/QtXml -isystem /usr/include/qt5/QtCore -I. -I/usr/lib/qt5/mkspecs/cygwin-g++ -o testQDomDocument.o testQDomDocument.cc g++ -o testQDomDocument.exe testQDomDocument.o -lQt5Gui -lQt5Xml -lQt5Core -lGL -lpthread Number of found nodes: 1 node

# 0 node.toElement().text(): "Of course, \xE2\x80\x9CJason.\xE2\x80\x9D My thoughts, exactly." QDomNode::TextNode "Of course, \xE2\x80\x9CJason.\xE2\x80\x9D My thoughts, exactly." $

Et voilà! Now, there is only one child node in

with the complete text including the quotes which are encoded as NCRs.

Though, the output of the quotes as \xE2\x80\x9C and \xE2\x80\x9D made me a bit uncertain. (Please, note that I added .toUtf8() to debug output because I got ? and ? before.)

A short check in UTF-8 encoding table and Unicode characters convinced me that these UTF-8 byte sequences are correct.
But why the escaping?
Wrong LANG setting of my bash?

$ ./testQDomDocument 2>&1 | hexdump -C
00000000  4e 75 6d 62 65 72 20 6f  66 20 66 6f 75 6e 64 20  |Number of found |
00000010  3c 70 3e 20 6e 6f 64 65  73 3a 20 31 0a 6e 6f 64  | nodes: 1.nod|
00000020  65 20 3c 70 3e 20 23 20  30 0a 6e 6f 64 65 2e 74  |e  # 0.node.t|
00000030  6f 45 6c 65 6d 65 6e 74  28 29 2e 74 65 78 74 28  |oElement().text(|
00000040  29 3a 20 20 22 4f 66 20  63 6f 75 72 73 65 2c 20  |):  "Of course, |
00000050  5c 78 45 32 5c 78 38 30  5c 78 39 43 4a 61 73 6f  |\xE2\x80\x9CJaso|
00000060  6e 2e 5c 78 45 32 5c 78  38 30 5c 78 39 44 20 4d  |n.\xE2\x80\x9D M|
00000070  79 20 74 68 6f 75 67 68  74 73 2c 20 65 78 61 63  |y thoughts, exac|
00000080  74 6c 79 2e 22 0a 51 44  6f 6d 4e 6f 64 65 3a 3a  |tly.".QDomNode::|
00000090  54 65 78 74 4e 6f 64 65  0a 22 4f 66 20 63 6f 75  |TextNode."Of cou|
000000a0  72 73 65 2c 20 5c 78 45  32 5c 78 38 30 5c 78 39  |rse, \xE2\x80\x9|
000000b0  43 4a 61 73 6f 6e 2e 5c  78 45 32 5c 78 38 30 5c  |CJason.\xE2\x80\|
000000c0  78 39 44 20 4d 79 20 74  68 6f 75 67 68 74 73 2c  |x9D My thoughts,|
000000d0  20 65 78 61 63 74 6c 79  2e 22 0a                 | exactly.".|
000000db

$

Aha. That rather seems to be caused by qDebug() which escapes all bytes with values of 128 and above.

How can I keep &ldquo; when I use QDomDocument to parse html data?

Answers (2)

Related Questions

How can I keep &amp;ldquo; when I use QDomDocument to parse html data?

Answers (2)

Related Questions

How can I keep “ when I use QDomDocument to parse html data?