Reputation:
I try to extract several forum posts by using the standard XPath method:
response.xpath('.//div[contains(@class, "Message userContent")]')
That one returns a complete list of comments as wished.
But once I include //text()
or string(...)
the length of the list jumps up to 100 or 150 items, which makes it impossible to grasp or to iterate over the list and join it with other data like author or the date...
normalize-space(...)
only returns the first comment.
It has to do something with all the new lines and breaks in the html code but at this stage I have no idea how to handle these.
Would string-join(...[normalize-space()])
be an option here?
Upvotes: 1
Views: 4303
Reputation: 111501
Realize what each XPath is selecting:
.//div[contains(@class, "Message userContent")]
selects div
elements..//div[contains(@class, "Message userContent")]//text()
selects all text node descendants of those div
elements.normalize-space(.//div[contains(@class, "Message userContent")])
in XPath 1.0 takes the space-normalized string value of the first such div
element.normalize-space(.//div[contains(@class, "Message userContent")])
in XPath 2.0 is a runtime error when normalize-space()
is passed a sequence.If you want to get the string values of each such div
:
div
elements in the hosting
language and separately take the string value./string()
to the XPath.Upvotes: 3