Curunir The Colorful
Curunir The Colorful

Reputation: 71

How can I select all text within an element in scrapy if said element has other elements inside?

I've got a page with sections like this. It is basically a single question within the main p tag, but every time there are certain superscripts, it breaks my code.

The text I want to get is - "For Cosine Rule of any triangle ABC, b2 is equal to"

<p><span class="mcq_srt">MCQ.</span>For Cosine Rule of any triangle ABC, b<sup>2</sup> is equal to</p>
    <ol>
        <li>a<sup>2</sup> - c<sup>2</sup> + 2ab cos A</li>
        <li>a<sup>3</sup> + c<sup>3</sup> - 3ab cos A</li>
        <li>a<sup>2</sup> + c<sup>2</sup> - 2ac cos B</li>
        <li>a<sup>2</sup> - c<sup>2</sup> 4bc cos A</li>
    </ol>

When I try to do a select for the p, I miss out the 2 that are supposed to be super-scripted. Further, I also get two sentences in the list, which messes up a few things when I try to store the answers

 response.css('p::text') > ["For Cosine Rule of any triangle ABC, b", "is equal to"]

I could tried a select using

response.css('p sup::text')

and then try merging it by checking if a sentence ever started with a small letter but that messed up when I had many questions. Here's what I'm doing in my parse method

`
    questions = [x for x in questions if x not in [' ']] #The list I get usually has a bunch of ' ' in them
    question_sup = response.css('p sup::text').extract()
    answer_sup = response.css('li sup::text').extract()
    all_choices = response.css('li::text')[:-2].extract() #for choice
    all_answer = response.css('.dsplyans::text').extract() #for answer

    if len(question_sup) is not 0:
        count=-1
        for question in questions:
            if question[1].isupper() is False or question[0] in [',', '.']: #[1] because there is a space at the starting
                questions[count]+=question_sup.pop(0)+question
                del questions[count+1]

            count+=1

What I tried above fails quite a bunch of times, and I have no idea how I can debug it. I'm crawling quite a lot of pages, and I have no Idea how to debug this. I keep getting a cannot pop empty list error. I guess, that's because something is wrong with what I'm trying above. Any help would be much appreciated!

Upvotes: 1

Views: 1443

Answers (1)

lufte
lufte

Reputation: 1354

If you select all the elements with text within the p, including the p itself, you will get a list of text nodes that respect the order, so you can simply join the list with ''. Here:

>>> from scrapy.selector import Selector
>>> p = Selector(text='<p>For Cosine Rule of any triangle ABC, b<sup>2</sup> is equal to</p>')
>>> t = p.css('p::text, p *::text')  # Give me the text in <p>, plus the text of all of its descendants
>>> ''.join(t.extract())
'For Cosine Rule of any triangle ABC, b2 is equal to'

Of course, you will loose the super script notation. If you need to preserve it, you could do something like this:

>>> from scrapy.selector import Selector
>>> p = Selector(text='<p>For Cosine Rule of any triangle ABC, b<sup>2</sup> is equal to</p>')
>>> t = p.css('p::text, p *')
>>> result = []
>>> for e in t:
...     if type(e.root) is str:
...         result.append(e.root)
...     elif e.root.tag == 'sup':
...         result.append('^' + e.root.text)  # Assuming there can't be more nested elements
...     # handle other tags like sub
...
>>> ''.join(result)
'For Cosine Rule of any triangle ABC, b^2 is equal to'

Upvotes: 4

Related Questions