Scrapy: grabbing sibling elements of a regex match

Question

I'm using Scrapy to scrape college essay topics from college websites. I know how to match a keyword using a regular expression, but the information that I really want is the other elements in the same div as the match. The Response.css(...).re(...) function in Scrapy returns a string. Is there any way to navigate to the parent div of the regex match?

Example: https://admissions.utexas.edu/apply/freshman-admission#fndtn-freshman-admission-essay-topics. On the above page, I can match the essay topics h1 using: response.css("*::text").re("Essay Topics"). However, I can't figure out a way to grab the 2 actual essay topics in the same div under Topic A and Topic N.

Tarun Lalwani · Accepted Answer

That's not the right way to do it. You should use something like below

response.xpath("//div[@id='freshman-admission-essay-topics']//h5//text()").extract()

In case you just want css then you can use

In [7]: response.css("#freshman-admission-essay-topics h5::text, #freshman-admission-essay-topics h5 span::text").extract()
Out[7]: ['Topic A \xa0\xa0', 'Topic N']

Scrapy: grabbing sibling elements of a regex match

Answers (1)

Related Questions