Reputation: 71
I've got a page with sections like this. It is basically a single question within the main p tag, but every time there are certain superscripts, it breaks my code.
The text I want to get is - "For Cosine Rule of any triangle ABC, b2 is equal to"
<p><span class="mcq_srt">MCQ.</span>For Cosine Rule of any triangle ABC, b<sup>2</sup> is equal to</p>
<ol>
<li>a<sup>2</sup> - c<sup>2</sup> + 2ab cos A</li>
<li>a<sup>3</sup> + c<sup>3</sup> - 3ab cos A</li>
<li>a<sup>2</sup> + c<sup>2</sup> - 2ac cos B</li>
<li>a<sup>2</sup> - c<sup>2</sup> 4bc cos A</li>
</ol>
When I try to do a select for the p, I miss out the 2 that are supposed to be super-scripted. Further, I also get two sentences in the list, which messes up a few things when I try to store the answers
response.css('p::text') > ["For Cosine Rule of any triangle ABC, b", "is equal to"]
I could tried a select using
response.css('p sup::text')
and then try merging it by checking if a sentence ever started with a small letter but that messed up when I had many questions. Here's what I'm doing in my parse method
`
questions = [x for x in questions if x not in [' ']] #The list I get usually has a bunch of ' ' in them
question_sup = response.css('p sup::text').extract()
answer_sup = response.css('li sup::text').extract()
all_choices = response.css('li::text')[:-2].extract() #for choice
all_answer = response.css('.dsplyans::text').extract() #for answer
if len(question_sup) is not 0:
count=-1
for question in questions:
if question[1].isupper() is False or question[0] in [',', '.']: #[1] because there is a space at the starting
questions[count]+=question_sup.pop(0)+question
del questions[count+1]
count+=1
What I tried above fails quite a bunch of times, and I have no idea how I can debug it. I'm crawling quite a lot of pages, and I have no Idea how to debug this. I keep getting a cannot pop empty list error. I guess, that's because something is wrong with what I'm trying above. Any help would be much appreciated!
Upvotes: 1
Views: 1443
Reputation: 1354
If you select all the elements with text within the p
, including the p
itself, you will get a list of text nodes that respect the order, so you can simply join the list with ''
. Here:
>>> from scrapy.selector import Selector
>>> p = Selector(text='<p>For Cosine Rule of any triangle ABC, b<sup>2</sup> is equal to</p>')
>>> t = p.css('p::text, p *::text') # Give me the text in <p>, plus the text of all of its descendants
>>> ''.join(t.extract())
'For Cosine Rule of any triangle ABC, b2 is equal to'
Of course, you will loose the super script notation. If you need to preserve it, you could do something like this:
>>> from scrapy.selector import Selector
>>> p = Selector(text='<p>For Cosine Rule of any triangle ABC, b<sup>2</sup> is equal to</p>')
>>> t = p.css('p::text, p *')
>>> result = []
>>> for e in t:
... if type(e.root) is str:
... result.append(e.root)
... elif e.root.tag == 'sup':
... result.append('^' + e.root.text) # Assuming there can't be more nested elements
... # handle other tags like sub
...
>>> ''.join(result)
'For Cosine Rule of any triangle ABC, b^2 is equal to'
Upvotes: 4