Reputation: 2014
My spider codes are:
class TryItem(Item):
url = Field()
class BbcSpiderSpider(CrawlSpider):
name = "bbc_spider"
allowed_domains = ["www.bbc.com"]
start_urls = ['http://www.bbc.com/sport/0/tennis']
rules = (Rule(LinkExtractor(allow=['.*sport\/0\/tennis\/\d{8}']), callback='parse_item', follow=True),)
def parse_item(self, response):
Item = TryItem()
Item['url'] = response.url
yield Item
Through this spider, I am trying to collect the urls of all the articles on tennis. I use csv code:
scrapy crawl bbc_spier -o bbc.csv -t csv
The output I am looking for is:
http://www.bbc.com/sport/0/tennis/34322294
http://www.bbc.com/sport/0/tennis/14322295
...
http://www.bbc.com/sport/0/tennis/12345678
But, the spider also returns nonmatching urls as well, such as:
http://www.bbc.com/sport/0/tennis/29604652?print=true
http://www.bbc.com/sport/0/tennis/34252190?comments_page=11&filter=none&initial_page_size=10&sortBy=Created&sortOrder=Descending
Any suggestion? Thanks
Upvotes: 3
Views: 81
Reputation: 473933
Don't let the spider follow the unwanted urls by forcing a url to end after 8 digits:
.*sport\/0\/tennis\/\d{8}$
# IMPORTANT ^
Upvotes: 2