Reputation: 21

XPATH for Scrapy

So i am using SCRAPY to scrape off the books of a website.

I have the crawler working and it crawls fine, but when it comes to cleaning the HTML using the select in XPATH it is kinda not working out right. Now since it is a book website, i have almost 131 books on each page and their XPATH comes to be likes this

For example getting the title of the books -

1st Book --- > /html/body/div/div[3]/div/div/div[2]/div/ul/li/a/span
2nd Book --->  /html/body/div/div[3]/div/div/div[2]/div/ul/li[2]/a/span 
3rd book --->  /html/body/div/div[3]/div/div/div[2]/div/ul/li[3]/a/span

The DIV[] number increases with the book. I am not sure how to get this into a loop, so that it catches all the titles. I have to do this for Images and Author names too, but i think it will be similar. Just need to get this initial one done.

Thanks for your help in advance.

Upvotes: 0

Answers (2)

Tasawer Nawaz

Reputation: 935

There are different ways to get this

Best to select multiple nodes is, selecting on the basis of ids or class. e.g:
```
sel.xpath("//div[@id='id']")
```

You can select like this

for i in range(0, upto_num_of_divs):
    list = sel.xpath("//div[%s]" %i)

You can select like this

for i in range(0, upto_num_of_divs):
    list = sel.xpath("//div[position > =1 and position() < upto_num_of_divs])

Upvotes: 2

Leo

Reputation: 1046

Here is an example how you can parse your example html:

lis = hxs.select('//div/div[3]/div/div/div[2]/div/ul/li')
for li in lis:
    book_el = li.select('a/span/text()')

Often enough you can do something like //div[@class="final-price"]//span to get the list of all the spans in one xpath. The exact expression depends on your html, this is just to give you an idea.

Otherwise the code above should do the trick.

Upvotes: 0

XPATH for Scrapy

Answers (2)

Related Questions