Reputation: 95
I'm here to ask you some help with QXmlQuery and Xpath. I'm trying to use this combination to extract some data from several HTML documents. These documents are downloaded and then cleaned with the HTML Tidy Library.
The problem is when I try my XPath. Here is an example code :
[...]
<ul class="bullet" id="idTab2">
<li><span>Hauteur :</span> 1127 mm</li>
<li><span>Largeur :</span> 640 mm</li>
<li><span>Profondeur :</span> 685 mm</li>
<li><span>Poids :</span> 159.6 kg</li>
[...]
The clean code is stored in a QString "code" :
QStringList fields, values;
QXmlQuery query;
query.setFocus(code);
query.setQuery("//*[@id=\"idTab2\"]/*/*/string()");
query.evaluateTo(&fields);
My goal is to get all the fields (Hauteur, Largeur, Profondeur, Poids, etc.) and their value (1127 mm, 640 mm, 685 mm, 159.6 kg, etc.).
Question 1
As you can see, I use this XPath //*[@id="idTab2"]/*/*/string()
to recover the fields because this : //ul[@id="idTab2"]/li/span/string()
doesn't work. When I try to specify a tag name, it gives me nothing. It only works with *
. Why ? I've checked the code returned by the tidy function and the XPath is not altered. So, I don't see any prolem. Is this normal ? Or maybe there is something I don't know...
Question 2
In the previous XHTML code, the li tags wrap a span tag and some text. I don't know how to get only the text and not the content of the span tag. I tried :
//*[@id="idTab2"]/*/string()
gives : Hauteur : 1127 mm Largeur : 640 mm Profondeur : 685 mm
//*[@id="idTab2"]/*[2]/string()
gives : Nothing
So, if I'm not wrong, the text in the li tag is not considered as a child node but it should be. See the accepted answer : Select just text directly in node, not in child nodes.
Thanks for reading, I hope someone can help me.
Upvotes: 2
Views: 1218
Reputation: 20748
To get the elements (not the text representation) inside the different <li>
s, you can test the text content:
//*[@id=\"idTab2\"]/li[starts-with(span, "Hauteur")]
Same thing of other items:
//*[@id=\"idTab2\"]/li[starts-with(span, "Largeur")]
//*[@id=\"idTab2\"]/li[starts-with(span, "Profondeur")]
//*[@id=\"idTab2\"]/li[starts-with(span, "Poids")]
To get the string representation of these <li>
, you can use string()
around the whole expression, like this:
string(//*[@id=\"idTab2\"]/li[starts-with(span, "Poids")])
which gives "Poids : 159.6 kg"
To extract only the text node in the <li>
, without the <span>
, you can use these expressions, which select the text nodes which are direct children of <li>
(<span>
is not a text node), and removes the leading and trailing whitespace characters (normalize-space()
)
normalize-space(//*[@id=\"idTab2\"]/li[starts-with(span, "Hauteur")]/text())
normalize-space(//*[@id=\"idTab2\"]/li[starts-with(span, "Largeur")]/text())
normalize-space(//*[@id=\"idTab2\"]/li[starts-with(span, "Profondeur")]/text())
normalize-space(//*[@id=\"idTab2\"]/li[starts-with(span, "Poids")]/text())
The last on gives "159.6 kg"
Upvotes: 1