Dee
Dee

Reputation: 23

Scrape All External Links from Multiple URLs in a Text File with Scrapy

I am new to Scrapy and Python and as such I am a beginner. I want to be able to have Scrapy read a text file with a seed list of around 100k urls, have Scrapy visit each URL, and extract all external URLs (URLs of Other Sites) found on each of those Seed URLs and export the results to a separate text file.

Scrapy should only visit the URLs in the text file, not spider out and follow any other URL.

I want to be able to have Scrapy work as fast as possible, I have a very powerful server with a 1GBS line. Each URL in my list is from a unique domain, so I won't be hitting any 1 site hard at all and thus won't be encountering IP blocks.

How would I go about creating a project in Scrapy to be able to extract all external links from a list of urls stored in a textfile?

Thanks.

Upvotes: 1

Views: 2411

Answers (1)

T Vlad
T Vlad

Reputation: 63

You should use:
1. start_requests function for reading list of urls.
2. css or xpath selector for all "a" html elements.

from scrapy import Spider

class YourSpider(Spider):
    name = "your_spider"

    def start_requests(self):
        with open('your_input.txt', 'r') as f:  # read the list of urls
           for url in f.readlines()             # process each of them
               yield Request(url, callback=self.parse)

    def parse(self, response):
        item = YourItem(parent_url=response.url)
        item['child_urls'] = response.css('a::attr(href)').extract()
        return item

More info about start_requests here:
http://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.Spider.start_requests

For extracting scraped items to another file use Item Pipeline or Feed Export. Basic pipeline example here:
http://doc.scrapy.org/en/latest/topics/item-pipeline.html#write-items-to-a-json-file

Upvotes: 1

Related Questions