Jshee
Jshee

Reputation: 2686

Match html output for result scrapy (skip first match)

I have existing scrapy code, but am having trouble formulating NEXT_PAGE_SELECTOR that will select the element via css select in scrapy:

def parse(self, response):
'''
        get the first page of results.
    '''
    SET_SELECTOR = 'b_algo'
    for bresult in response.css(SET_SELECTOR):
        NAME_SELECTOR = 'h2 a ::text'
        yield {
            'name': bresult.css(NAME_SELECTOR).extract_first(),
        }

    '''
        get the further pages of results.
    '''
    #<<NEXT_PAGE_SELECTOR here>>

The html Im trying to match is:

<ul class="sb_pagF" aria-label="More pages with results">
<li>
          <a title="Next page" class="sb_pagN" href="/search?q=site%3asite.com&amp;first=11&amp;FORM=PORE">
            <div class="sw_next">Next
            </div>
          </a>
</li>
</ul>

I've formulated the following to match this:

NEXT_PAGE_SELECTOR = '.sb_pagF li a ::attr(href)'

Does this look right to grab the href?

Thanks!

Upvotes: 0

Views: 322

Answers (2)

Granitosaurus
Granitosaurus

Reputation: 21436

Yes it is correct:

$ scrapy shell
In[1]: foo = """<ul class="sb_pagF" aria-label="More pages with results">
<li>
          <a title="Next page" class="sb_pagN" href="/search?q=site%3asite.com&amp;first=11&amp;FORM=PORE">
            <div class="sw_next">Next
            </div>
          </a>
</li>
</ul>"""
In [2]: from scrapy import Selector
In [3]: sel = Selector(text=foo)
In [4]: sel.css('.sb_pagF li a ::attr(href)').extract()
Out[1]: [u'/search?q=site%3asite.com&first=11&FORM=PORE']

Upvotes: 3

alecxe
alecxe

Reputation: 473893

You can always test your selectors in the Scrapy Shell pointing it to your local html:

$ cat index.html
<ul class="sb_pagF" aria-label="More pages with results">
    <li>
        <a title="Next page" class="sb_pagN" href="/search?q=site%3asite.com&amp;first=11&amp;FORM=PORE">
            <div class="sw_next">Next
            </div>
        </a>
    </li>
</ul>
$ scrapy shell file://$PWD/index.html
In [1]: response.css(".sb_pagF li a ::attr(href)").extract_first()
Out[1]: u'/search?q=site%3asite.com&first=11&FORM=PORE'

Upvotes: 3

Related Questions