Reputation: 460
I have this sample-html:
<div class="classname1">
"This is text inside of"
<b>"a subtag"</b>
"I would like to select."
<br>
"More text I don't need"
</br>
(more br and b tags on the same level)
</div>
The result should be a list containing:
["This is text inside of a subtag I would like to select."]
I tried:
response.xpath('//div[@class="classname1"]//text()[1]').getall()
but this gives me only the first part "This is text inside".
There are two challenges:
Maybe a loop? If anyone has an approach it would be really helpful.
Upvotes: 1
Views: 349
Reputation: 10666
What about this (used More text I don't need
as a stopword):
parts = []
for text in response.xpath('//div[@class="classname1"]//text()').getall():
if 'More text I don't need' in text:
break
parts.append(text)
result = ' '.join(parts)
UPDATE For example, you need to extract all text before Ort:
:
def parse(self, response):
for card_node in response.xpath('//div[@class="col-md-8 col-sm-12 card-place-container"]'):
parts = []
for text in card_node.xpath('.//text()').getall():
if 'Ort: ' in text:
break
parts.append(text)
before_ort = '\n'.join(parts)
print(before_ort)
Upvotes: 1
Reputation: 2110
Use the descendant or self xpath selector in combination with the position selector as below
response.xpath('//div[@class="classname1"]/descendant-or-self::*/text()[position() <3]').getall()
Upvotes: 0