Reputation: 21
So i am using SCRAPY to scrape off the books of a website.
I have the crawler working and it crawls fine, but when it comes to cleaning the HTML using the select in XPATH it is kinda not working out right. Now since it is a book website, i have almost 131 books on each page and their XPATH comes to be likes this
For example getting the title of the books -
1st Book --- > /html/body/div/div[3]/div/div/div[2]/div/ul/li/a/span
2nd Book ---> /html/body/div/div[3]/div/div/div[2]/div/ul/li[2]/a/span
3rd book ---> /html/body/div/div[3]/div/div/div[2]/div/ul/li[3]/a/span
The DIV[] number increases with the book. I am not sure how to get this into a loop, so that it catches all the titles. I have to do this for Images and Author names too, but i think it will be similar. Just need to get this initial one done.
Thanks for your help in advance.
Upvotes: 0
Views: 2995
Reputation: 935
There are different ways to get this
Best to select multiple nodes is, selecting on the basis of ids or class. e.g:
sel.xpath("//div[@id='id']")
You can select like this
for i in range(0, upto_num_of_divs):
list = sel.xpath("//div[%s]" %i)
You can select like this
for i in range(0, upto_num_of_divs):
list = sel.xpath("//div[position > =1 and position() < upto_num_of_divs])
Upvotes: 2
Reputation: 1046
Here is an example how you can parse your example html:
lis = hxs.select('//div/div[3]/div/div/div[2]/div/ul/li')
for li in lis:
book_el = li.select('a/span/text()')
Often enough you can do something like //div[@class="final-price"]//span
to get the list of all the spans in one xpath. The exact expression depends on your html, this is just to give you an idea.
Otherwise the code above should do the trick.
Upvotes: 0