Elio Diaz
Elio Diaz

Reputation: 600

extracting text from css node scrapy

I'm trying to scrape a catalog id number from this page:

from scrapy.selector import Selector
from scrapy.http import HtmlResponse

url = 'http://www.enciclovida.mx/busquedas/resultados?utf8=%E2%9C%93&busqueda=basica&id=&nombre=astomiopsis+exserta&button='

response = HtmlResponse(url=url)

using the css selector (which works in R with rvest::html_nodes)

".result-nombre-container > h5:nth-child(2) > a:nth-child(1)"

I would like to retrieve the catalog id, which in this case should be:

6011038

I'm ok if it is done easier with the xpath

Upvotes: 2

Views: 236

Answers (3)

Thomas Strub
Thomas Strub

Reputation: 1285

There seems to be only one link in the h5 element. So in short:

response.css('h5 > a::attr(href)').re('(\d+)$')

Upvotes: 0

gangabass
gangabass

Reputation: 10666

If you need to parse id from href:

catalog_id = response.xpath("//div[contains(@class, 'result-nombre-container')]/h5[2]/a/@href").re_first( r'(\d+)$' )

Upvotes: 1

Kevin Kamonseki
Kevin Kamonseki

Reputation: 141

I don't have scrapy here, but tested this xpath and it will get you the href:

//div[contains(@class, 'result-nombre-container')]/h5[2]/a/@href

If you're having too much trouble with scrapy and css selector syntax, I would also suggest trying out BeautifulSoup python package. With BeautifulSoup you can do things like

link.get('href')

Upvotes: 1

Related Questions