ivanmacx
ivanmacx

Reputation: 13

Restricting scrapy to parsing a single tag

I am trying to crawl www.tvtropes.org using scrapy, for example:

Belle - TV Tropes

I'm invoking the shell to try out the scrape, using the above webpage and then getting the relevant section of the page by selecting the div tag which has the attribute itemprop="articleBody". This all works fine.

scrapy shell "http://tvtropes.org/pmwiki/pmwiki.php/Film/Belle"
itembody = response.xpath('//div[@itemprop="articleBody"]')

I want to then extract all the invidual list items in that tag, being the list of tropes listed for that film. I thought I could do this with:

itembody.xpath('//li')

However, that gives me a huge list of 'li' tags including lots from elsewhere in the page which are not within the 'div' tag I selected. If I want to restrict it to that tag I have to re-state the tag criterion again as follows:

itembody.xpath('//div[@itemprop="articleBody"]//li')

I can do that as a workaround, but I thought that itembody would contain only that tag and not the rest of the page so I'm confused. Can anyone explain this to me?

Thanks in advance.

Upvotes: 0

Views: 115

Answers (2)

ivanmacx
ivanmacx

Reputation: 13

OK, I promise I searched and searched before asking this question but, of course, I found the answer about 5 minutes after posting.

I need to make the subsequent xpath a relative, rather than absolute reference ie.

itembody.xpath('.//li')

The '.' at the beginning of the xpath sets it to look only within the current item, whereas starting with '/' is like designating root as the starting point. Just like a file directory reference.

Hopefully this helps someone else.

Upvotes: 1

Wonka
Wonka

Reputation: 1901

Try this Xpath:

//div[@itemprop='articleBody']/ul/li

With '/' you get elements that are the "first children" of the element.

With '//' you get elements that are children of childrens too.

Upvotes: 0

Related Questions