Reputation: 600
I'm trying to scrape a catalog id number from this page:
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
url = 'http://www.enciclovida.mx/busquedas/resultados?utf8=%E2%9C%93&busqueda=basica&id=&nombre=astomiopsis+exserta&button='
response = HtmlResponse(url=url)
using the css selector (which works in R with rvest::html_nodes)
".result-nombre-container > h5:nth-child(2) > a:nth-child(1)"
I would like to retrieve the catalog id, which in this case should be:
6011038
I'm ok if it is done easier with the xpath
Upvotes: 2
Views: 236
Reputation: 1285
There seems to be only one link in the h5 element. So in short:
response.css('h5 > a::attr(href)').re('(\d+)$')
Upvotes: 0
Reputation: 10666
If you need to parse id
from href
:
catalog_id = response.xpath("//div[contains(@class, 'result-nombre-container')]/h5[2]/a/@href").re_first( r'(\d+)$' )
Upvotes: 1
Reputation: 141
I don't have scrapy here, but tested this xpath and it will get you the href:
//div[contains(@class, 'result-nombre-container')]/h5[2]/a/@href
If you're having too much trouble with scrapy and css selector syntax, I would also suggest trying out BeautifulSoup python package. With BeautifulSoup you can do things like
link.get('href')
Upvotes: 1