Daniel
Daniel

Reputation: 460

How to select multiple text parts with Scrapy inside of a tag with subtags?

I have this sample-html:

<div class="classname1">
  "This is text inside of" 
  <b>"a subtag"</b>
  "I would like to select."
  <br>
  "More text I don't need"
  </br>
  
  (more br and b tags on the same level)

</div>
                   

The result should be a list containing:

["This is text inside of a subtag I would like to select."]  

I tried:

response.xpath('//div[@class="classname1"]//text()[1]').getall()

but this gives me only the first part "This is text inside".

There are two challenges:

  1. Sometimes there is no b tag
  2. There is even more text after the desired section that should be expluded

Maybe a loop? If anyone has an approach it would be really helpful.

Upvotes: 1

Views: 349

Answers (2)

gangabass
gangabass

Reputation: 10666

What about this (used More text I don't need as a stopword):

parts = []
for text in response.xpath('//div[@class="classname1"]//text()').getall():
    if 'More text I don't need' in text:
        break
    parts.append(text)
result = ' '.join(parts)

UPDATE For example, you need to extract all text before Ort: :

def parse(self, response):
    for card_node in response.xpath('//div[@class="col-md-8 col-sm-12 card-place-container"]'):
        parts = []
        for text in card_node.xpath('.//text()').getall():
            if 'Ort: ' in text:
                break
            parts.append(text)
        before_ort = '\n'.join(parts)
        print(before_ort)

Upvotes: 1

msenior_
msenior_

Reputation: 2110

Use the descendant or self xpath selector in combination with the position selector as below

response.xpath('//div[@class="classname1"]/descendant-or-self::*/text()[position() <3]').getall()

Upvotes: 0

Related Questions