Warlord
Warlord

Reputation: 21

XPATH for Scrapy

So i am using SCRAPY to scrape off the books of a website.

I have the crawler working and it crawls fine, but when it comes to cleaning the HTML using the select in XPATH it is kinda not working out right. Now since it is a book website, i have almost 131 books on each page and their XPATH comes to be likes this

For example getting the title of the books -

1st Book --- > /html/body/div/div[3]/div/div/div[2]/div/ul/li/a/span
2nd Book --->  /html/body/div/div[3]/div/div/div[2]/div/ul/li[2]/a/span 
3rd book --->  /html/body/div/div[3]/div/div/div[2]/div/ul/li[3]/a/span 

The DIV[] number increases with the book. I am not sure how to get this into a loop, so that it catches all the titles. I have to do this for Images and Author names too, but i think it will be similar. Just need to get this initial one done.

Thanks for your help in advance.

Upvotes: 0

Views: 2995

Answers (2)

Tasawer Nawaz
Tasawer Nawaz

Reputation: 935

There are different ways to get this

  1. Best to select multiple nodes is, selecting on the basis of ids or class. e.g:

    sel.xpath("//div[@id='id']")
    
  2. You can select like this

    for i in range(0, upto_num_of_divs):
        list = sel.xpath("//div[%s]" %i)
    
  3. You can select like this

    for i in range(0, upto_num_of_divs):
        list = sel.xpath("//div[position > =1 and position() < upto_num_of_divs])
    

Upvotes: 2

Leo
Leo

Reputation: 1046

Here is an example how you can parse your example html:

lis = hxs.select('//div/div[3]/div/div/div[2]/div/ul/li')
for li in lis:
    book_el = li.select('a/span/text()')

Often enough you can do something like //div[@class="final-price"]//span to get the list of all the spans in one xpath. The exact expression depends on your html, this is just to give you an idea.

Otherwise the code above should do the trick.

Upvotes: 0

Related Questions