Scrapy: multiple "start_urls" yield duplicated results

Question

Although my simple code seems OK according to the official document, it generates unexpectedly duplicated results such as:

9 rows/results when setting 3 URLs
4 rows/ results when setting 2 URLs

When I set just 1 URL, my code works fine. Also, I tried the answer solution in this SO question, but it didn't solve my issue.

[Scrapy command]

$ scrapy crawl test -o test.csv

[Scrapy spider: test.py]

import scrapy
from ..items import TestItem

class TestSpider(scrapy.Spider):
    name = 'test'
    start_urls = [
        'file:///Users/Name/Desktop/tutorial/test1.html',
        'file:///Users/Name/Desktop/tutorial/test2.html',
        'file:///Users/Name/Desktop/tutorial/test3.html',
    ]

    def parse(self, response):
        for url in self.start_urls:
            table_rows = response.xpath('//table/tbody/tr')

            for table_row in table_rows:
                item = TestItem()
                item['test_01'] = table_row.xpath('td[1]/text()').extract_first()
                item['test_02'] = table_row.xpath('td[2]/text()').extract_first()

                yield item

[Target HTML: test1.html, test2.html, test3.html]

[Generated CSV results for 3 URLs]

test_01,test_02
test1 A1,test1 B1
test1 A1,test1 B1
test1 A1,test1 B1
test2 A1,test2 B1
test2 A1,test2 B1
test2 A1,test2 B1
test3 A1,test3 B1
test3 A1,test3 B1
test3 A1,test3 B1

[Expected results for 3 URLs]

test_01,test_02
test1 A1,test1 B1
test2 A1,test2 B1
test3 A1,test3 B1

[Generated CSV results for 2 URLs]

test_01,test_02
test1 A1,test1 B1
test1 A1,test1 B1
test2 A1,test2 B1
test2 A1,test2 B1

[Expected results for 2 URLs]

test_01,test_02
test1 A1,test1 B1
test2 A1,test2 B1

Guillaume · Accepted Answer

You are iterating again the start_urls, you don't need to to that, scrapy already does it for you, so now you are looping twice on the start_urls.

Try that instead:

import scrapy
from ..items import TestItem

class TestSpider(scrapy.Spider):
    name = 'test'
    start_urls = [
        'file:///Users/Name/Desktop/tutorial/test1.html',
        'file:///Users/Name/Desktop/tutorial/test2.html',
        'file:///Users/Name/Desktop/tutorial/test3.html',
    ]

    def parse(self, response):
        table_rows = response.xpath('//table/tbody/tr')

        for table_row in table_rows:
            item = TestItem()
            item['test_01'] = table_row.xpath('td[1]/text()').extract_first()
            item['test_02'] = table_row.xpath('td[2]/text()').extract_first()

            yield item

Scrapy: multiple "start_urls" yield duplicated results

Answers (1)

Related Questions

Scrapy: multiple &quot;start_urls&quot; yield duplicated results

Answers (1)

Related Questions

Scrapy: multiple "start_urls" yield duplicated results