kevin
kevin

Reputation: 2014

scrapy regex returning nonmatching urls as well

My spider codes are:

class TryItem(Item):
    url = Field()

class BbcSpiderSpider(CrawlSpider):
    name = "bbc_spider"
    allowed_domains = ["www.bbc.com"]
    start_urls = ['http://www.bbc.com/sport/0/tennis']

    rules = (Rule(LinkExtractor(allow=['.*sport\/0\/tennis\/\d{8}']), callback='parse_item', follow=True),)

    def parse_item(self, response):
        Item = TryItem()
        Item['url'] = response.url
        yield Item

Through this spider, I am trying to collect the urls of all the articles on tennis. I use csv code:

scrapy crawl bbc_spier -o bbc.csv -t csv

The output I am looking for is:

http://www.bbc.com/sport/0/tennis/34322294
http://www.bbc.com/sport/0/tennis/14322295
...
http://www.bbc.com/sport/0/tennis/12345678

But, the spider also returns nonmatching urls as well, such as:

http://www.bbc.com/sport/0/tennis/29604652?print=true
http://www.bbc.com/sport/0/tennis/34252190?comments_page=11&filter=none&initial_page_size=10&sortBy=Created&sortOrder=Descending

Any suggestion? Thanks

Upvotes: 3

Views: 81

Answers (1)

alecxe
alecxe

Reputation: 473933

Don't let the spider follow the unwanted urls by forcing a url to end after 8 digits:

.*sport\/0\/tennis\/\d{8}$
#              IMPORTANT ^

Upvotes: 2

Related Questions