Reputation: 31
I have programmed a spider in scrapy to extract data from a website. I have a list of links with similar-structured html tables and the extraction of those works fine so far. Now the problem is that some of these tables run over multiple pages, i.e. if a data set is longer than 30 rows, it's split up. Most tables only have the one page, but some are considerably longer with up to 70 following pages. The next page is reached through pressing a "next sheet" button in the form of an html form. I want the spider to go to each of the tables, extract the data from the first page, then proceed to the second, third page etc. until there is no "next" button anymore, then continue with the next of the original urls.
I understand that what I need is probably the form_request command but i am new to these things and all the examples I have found on the web were structured slightly differently, so help would be greatly appreciated.
This is my code which extracts the first page of each table.
from scrapy.spiders import BaseSpider
from scrapy.selector import HtmlXPathSelector
from example.items import exitem
from scrapy.http import FormRequest
class MySpider(BaseSpider):
name = "example"
with open('linklist.txt') as f:
start_urls = f.readlines()
def parse(self, response):
hxs = HtmlXPathSelector(response)
main = hxs.xpath("/html/body/table[2]/tr/td[2]/table/tr/td/table[1]/tr[1]/td[1]/table")
titles = hxs.xpath("/html/body/table[2]/tr/td[2]/table/tr/td/table[1]/tr[2]/td/table/tr")
items = []
for titles in titles:
item = exitem()
item["pid"] = titles.xpath("td[2]/font/text()").extract()
item["famname"] = titles.xpath("td[3]/font/b/text()").extract()
item["firstname"] = titles.xpath("td[4]/font/text()").extract()
item["sex"] = titles.xpath("td[5]/font/text()").extract()
item["age"] = titles.xpath("td[6]/font/text()").extract()
item["famstat"] = titles.xpath("td[7]/font/text()").extract()
item["res"] = titles.xpath("td[8]/font/text()").extract()
item["nation"] = titles.xpath("td[9]/font/text()").extract()
item["state"] = titles.xpath("td[10]/font/text()").extract()
item["job"] = titles.xpath("td[11]/font/text()").extract()
return(items)
This is the form on the website:
<form action="http://example.com/listen.php" method="get">
<input type="submit" value="next sheet" name="">
<input type="hidden" value="1234567" name="ArchivIdent">
<input type="hidden" value="31" name="start">
</form>
The "start" value is 31 for the second page, 61 for the third page, 91 for the fourth etc.
Upvotes: 2
Views: 1503
Reputation: 100
from selenium import web driver
driver = webdriver.Firefox()
driver.get("your page")
try:
driver.findElement(By.xpath("//*[@type='submit'][@value='next']")).click()
except:
pass
continue with your program
Once the button is not found, it will come out of try.Then continue with your program
Hope this helps.
Upvotes: 1