Reputation: 3346
void test()
{
QDomDocument doc("doc");
QByteArray data = "<div><p>Of course, “Jason.” My thoughts, exactly.</p></div>";
QString sErrorMsg;
int errLine, errCol;
if (!doc.setContent(data, &sErrorMsg, &errLine, &errCol)) {
qDebug() << sErrorMsg;
qDebug() << errLine << ":" << errCol;
return;
}
QDomNodeList pList = doc.elementsByTagName("p");
for (int i = 0; i < pList.size(); i++)
{
QDomNode p = pList.at(i);
while (!p.isNull()) {
QDomElement e = p.toElement();
if (!e.isNull()) {
QByteArray ba = e.text().toUtf8(); //Here, there is no left and right quota marks anymore.
}
p = p.nextSibling();
}
}
}
I'm parsing a html phrase with “
and ”
. The code runs to QByteArray ba = e.text().toUtf8();
without the quota marks.
How do I keep them?
Upvotes: 1
Views: 415
Reputation: 3346
QTextDocument text;
text.setHtml("<>"");
qDebug() << text.toPlainText();
I found this way, at least I don't have to hardcode to replace every escaped html character.
Upvotes: 0
Reputation: 20141
I must admit that this is the first time that I used QDomDocument although I already have some experience with XML in general and libXml2 specifically.
First, I can confirm that QDomElement::text() returns text without the typographical quotes encoded by entities.
I modified the MCVE of OP a bit and now, it should be obvious why this happens.
My testQDomDocument.cc
:
#include <QtXml>
static const char* toString(QDomNode::NodeType nodeType);
int main(int, char**)
{
QByteArray text = "<div><p>Of course, “Jason.” My thoughts, exactly.</p></div>";
// setup doc. DOM
QDomDocument qDomDoc("doc");
QString qErrorMsg; int errorLine = 0, errorCol = 0;
if (!qDomDoc.setContent(text, &qErrorMsg, &errorLine, &errorCol)) {
qDebug() << "Line:" << errorLine << "Col.:" << errorCol << qErrorMsg;
return 1;
}
// inspect DOM
QDomNodeList qListP = qDomDoc.elementsByTagName("p");
const int nP = qListP.size();
qDebug() << "Number of found <p> nodes:" << nP;
for (int i = 0; i < nP; ++i) {
const QDomNode qNodeP = qListP.at(i);
qDebug() << "node <p> #" << i;
qDebug() << "node.toElement().text(): " << qNodeP.toElement().text();
for (QDomNode qNode = qNodeP.firstChild(); !qNode.isNull(); qNode = qNode.nextSibling()) {
qDebug() << toString(qNode.nodeType());
switch (qNode.nodeType()) {
case QDomNode::TextNode:
#if 1 // IMHO, the correct way:
qDebug() << qNode.toText().data();
#else // works as well:
qDebug() << qNode.nodeValue();
#endif // 1
break;
case QDomNode::EntityReferenceNode:
qDebug() << qNode.nodeName();
break;
default:; // rest of types left out to keep sample short
}
}
}
// done
return 0;
}
const char* toString(QDomNode::NodeType nodeType)
{
static const std::map<QDomNode::NodeType, const char*> mapNodeTypes {
{ QDomNode::ElementNode, "QDomNode::ElementNode" },
{ QDomNode::AttributeNode, "QDomNode::AttributeNode" },
{ QDomNode::TextNode, "QDomNode::TextNode" },
{ QDomNode::CDATASectionNode, "QDomNode::CDATASectionNode" },
{ QDomNode::EntityReferenceNode, "QDomNode::EntityReferenceNode" },
{ QDomNode::EntityNode, "QDomNode::EntityNode" },
{ QDomNode::ProcessingInstructionNode, "QDomNode::ProcessingInstructionNode" },
{ QDomNode::CommentNode, "QDomNode::CommentNode" },
{ QDomNode::DocumentNode, "QDomNode::DocumentNode" },
{ QDomNode::DocumentTypeNode, "QDomNode::DocumentTypeNode" },
{ QDomNode::DocumentFragmentNode, "QDomNode::DocumentFragmentNode" },
{ QDomNode::NotationNode, "QDomNode::NotationNode" },
{ QDomNode::BaseNode, "QDomNode::BaseNode" },
{ QDomNode::CharacterDataNode, "QDomNode::CharacterDataNode" }
};
const std::map<QDomNode::NodeType, const char*>::const_iterator iter
= mapNodeTypes.find(nodeType);
return iter != mapNodeTypes.end() ? iter->second : "<ERROR>";
}
The Qt project file – testQDomDocument.pro
:
SOURCES = testQDomDocument.cc
QT += xml
Build and test:
$ qmake-qt5 testQDomDocument.pro
$ make && ./testQDomDocument
g++ -c -fno-keep-inline-dllexport -D_GNU_SOURCE -pipe -O2 -Wall -W -D_REENTRANT -DQT_NO_DEBUG -DQT_GUI_LIB -DQT_XML_LIB -DQT_CORE_LIB -I. -isystem /usr/include/qt5 -isystem /usr/include/qt5/QtGui -isystem /usr/include/qt5/QtXml -isystem /usr/include/qt5/QtCore -I. -I/usr/lib/qt5/mkspecs/cygwin-g++ -o testQDomDocument.o testQDomDocument.cc
g++ -o testQDomDocument.exe testQDomDocument.o -lQt5Gui -lQt5Xml -lQt5Core -lGL -lpthread
Number of found <p> nodes: 1
node <p> # 0
node.toElement().text(): "Of course, Jason. My thoughts, exactly."
QDomNode::TextNode
"Of course, "
QDomNode::EntityReferenceNode
"ldquo"
QDomNode::TextNode
"Jason."
QDomNode::EntityReferenceNode
"rdquo"
QDomNode::TextNode
" My thoughts, exactly."
$
To understand what happened it helps to know that the contents of <p>
isn't stored in the QDomNode
instance for <p>
directly. Instead, the QDomNode
instance for <p>
(as well as any other element) has child nodes to store its contents, e.g. a QDomText instance to store a piece of text.
So, the QDomElement::text()
is a convenience function which returns only the (collected) text but seems to ignore any other nodes.
In OPs sample, not all child nodes of the QDomElement
for <p>
are text nodes.
The entities (“
, ”
) are stored as QDomEntityReference instances and obviously skipped in QDomElement::text()
.
I must admit I was a bit surprised because (according to my experience in libXml2
) I'm used to the fact that entities are resolved into text as well.
The paragraph in QDomEntityReference:
Moreover, the XML processor may completely expand references to entities while building the DOM tree, instead of providing QDomEntityReference objects.
supported my same expectation for QDomDocument
.
However, the sample shows that this isn't true in this case.
Thinking twice, I realized that “
and ”
are not predefined entities in XML.
This is the case in HTML5 (and before) but not in general XML.
The only predefined entities in XML are:
Name | Chr. | Codepoint | Meaning
-----+------+-------------+-----------------
quot | " | U+0022 (34) | quotation mark
amp | & | U+0026 (38) | ampersand
apos | ' | U+0027 (39) | apostrophe
lt | < | U+003C (60) | less-than sign
gt | > | U+003E (62) | greater-than sign
So, for the replacement of HTML entities, something else is needed in QDomDocument
.
Btw. while looking for a hint into this direction, I stumbled into:
SO: QDomDocument fails to set content of an HTML document with tag
I thought a while about how this can be fixed.
I wonder that I didn't think immediately on a very simple fix: replacing the entities by numeric character references.
HTML Entity | NCR
------------+----------
“ | “
” | ”
With a slight modification of the above sample:
int main(int, char**)
{
QByteArray text =
"<div><p>Of course, “Jason.” My thoughts, exactly.</p></div>";
// setup doc. DOM
QDomDocument qDomDoc("doc");
QString qErrorMsg; int errorLine = 0, errorCol = 0;
if (!qDomDoc.setContent(text, &qErrorMsg, &errorLine, &errorCol)) {
qDebug() << "Line:" << errorLine << "Col.:" << errorCol << qErrorMsg;
return 1;
}
// inspect DOM
QDomNodeList qListP = qDomDoc.elementsByTagName("p");
const int nP = qListP.size();
qDebug() << "Number of found <p> nodes:" << nP;
for (int i = 0; i < nP; ++i) {
const QDomNode qNodeP = qListP.at(i);
qDebug() << "node <p> #" << i;
qDebug() << "node.toElement().text(): " << qNodeP.toElement().text().toUtf8();
for (QDomNode qNode = qNodeP.firstChild(); !qNode.isNull(); qNode = qNode.nextSibling()) {
qDebug() << toString(qNode.nodeType());
switch (qNode.nodeType()) {
case QDomNode::TextNode:
qDebug() << qNode.toText().data().toUtf8();
break;
case QDomNode::EntityReferenceNode:
qDebug() << qNode.nodeName();
break;
default:; // rest of types left out to keep sample short
}
}
}
// done
return 0;
}
I got the following output:
$ make && ./testQDomDocument
g++ -c -fno-keep-inline-dllexport -D_GNU_SOURCE -pipe -O2 -Wall -W -D_REENTRANT -DQT_NO_DEBUG -DQT_GUI_LIB -DQT_XML_LIB -DQT_CORE_LIB -I. -isystem /usr/include/qt5 -isystem /usr/include/qt5/QtGui -isystem /usr/include/qt5/QtXml -isystem /usr/include/qt5/QtCore -I. -I/usr/lib/qt5/mkspecs/cygwin-g++ -o testQDomDocument.o testQDomDocument.cc
g++ -o testQDomDocument.exe testQDomDocument.o -lQt5Gui -lQt5Xml -lQt5Core -lGL -lpthread
Number of found <p> nodes: 1
node <p> # 0
node.toElement().text(): "Of course, \xE2\x80\x9CJason.\xE2\x80\x9D My thoughts, exactly."
QDomNode::TextNode
"Of course, \xE2\x80\x9CJason.\xE2\x80\x9D My thoughts, exactly."
$
Et voilà! Now, there is only one child node in <p>
with the complete text including the quotes which are encoded as NCRs.
Though, the output of the quotes as \xE2\x80\x9C
and \xE2\x80\x9D
made me a bit uncertain. (Please, note that I added .toUtf8()
to debug output because I got ?
and ?
before.)
A short check in UTF-8 encoding table and Unicode characters convinced me that these UTF-8 byte sequences are correct.
But why the escaping?
Wrong LANG
setting of my bash
?
$ ./testQDomDocument 2>&1 | hexdump -C
00000000 4e 75 6d 62 65 72 20 6f 66 20 66 6f 75 6e 64 20 |Number of found |
00000010 3c 70 3e 20 6e 6f 64 65 73 3a 20 31 0a 6e 6f 64 |<p> nodes: 1.nod|
00000020 65 20 3c 70 3e 20 23 20 30 0a 6e 6f 64 65 2e 74 |e <p> # 0.node.t|
00000030 6f 45 6c 65 6d 65 6e 74 28 29 2e 74 65 78 74 28 |oElement().text(|
00000040 29 3a 20 20 22 4f 66 20 63 6f 75 72 73 65 2c 20 |): "Of course, |
00000050 5c 78 45 32 5c 78 38 30 5c 78 39 43 4a 61 73 6f |\xE2\x80\x9CJaso|
00000060 6e 2e 5c 78 45 32 5c 78 38 30 5c 78 39 44 20 4d |n.\xE2\x80\x9D M|
00000070 79 20 74 68 6f 75 67 68 74 73 2c 20 65 78 61 63 |y thoughts, exac|
00000080 74 6c 79 2e 22 0a 51 44 6f 6d 4e 6f 64 65 3a 3a |tly.".QDomNode::|
00000090 54 65 78 74 4e 6f 64 65 0a 22 4f 66 20 63 6f 75 |TextNode."Of cou|
000000a0 72 73 65 2c 20 5c 78 45 32 5c 78 38 30 5c 78 39 |rse, \xE2\x80\x9|
000000b0 43 4a 61 73 6f 6e 2e 5c 78 45 32 5c 78 38 30 5c |CJason.\xE2\x80\|
000000c0 78 39 44 20 4d 79 20 74 68 6f 75 67 68 74 73 2c |x9D My thoughts,|
000000d0 20 65 78 61 63 74 6c 79 2e 22 0a | exactly.".|
000000db
$
Aha. That rather seems to be caused by qDebug()
which escapes all bytes with values of 128 and above.
Upvotes: 1