Restricting scrapy to parsing a single tag

Question

I am trying to crawl www.tvtropes.org using scrapy, for example:

I'm invoking the shell to try out the scrape, using the above webpage and then getting the relevant section of the page by selecting the div tag which has the attribute itemprop="articleBody". This all works fine.

scrapy shell "http://tvtropes.org/pmwiki/pmwiki.php/Film/Belle"
itembody = response.xpath('//div[@itemprop="articleBody"]')

I want to then extract all the invidual list items in that tag, being the list of tropes listed for that film. I thought I could do this with:

itembody.xpath('//li')

However, that gives me a huge list of 'li' tags including lots from elsewhere in the page which are not within the 'div' tag I selected. If I want to restrict it to that tag I have to re-state the tag criterion again as follows:

itembody.xpath('//div[@itemprop="articleBody"]//li')

I can do that as a workaround, but I thought that itembody would contain only that tag and not the rest of the page so I'm confused. Can anyone explain this to me?

Thanks in advance.

Wonka · Accepted Answer

Try this Xpath:

//div[@itemprop='articleBody']/ul/li

With '/' you get elements that are the "first children" of the element.

With '//' you get elements that are children of childrens too.

Restricting scrapy to parsing a single tag

Answers (2)

Related Questions