Reputation: 3
What I want to do:
I can pause/resume my scrappy script thanks to this:
http://doc.scrapy.org/en/latest/topics/jobs.html
I've set the script so it splits every 1,000,000 records.
And the Python dictionary only checks url keys for duplicates within each text file. So at the very least, the urls within each file will be unique. If I had a bigger dictionary. it would tremendously slow down the process IMO. Having 1 duplicate (every 1,000,000 logs) is better than thousands.
This is the Python script code I'm currently using:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.item import Item, Field
class MyItem(Item):
url=Field()
f=open("items0"+".txt","w")
num=open("number.txt","w")
class someSpider(CrawlSpider):
name = "My script"
domain=raw_input("Enter the domain:\n")
allowed_domains = [domain]
starting_url=raw_input("Enter the starting url with protocol:\n")
start_urls = [starting_url]
i=0
j=0
dic={}
global f
rules = (Rule(LxmlLinkExtractor(allow_domains=(domain)), callback='parse_obj', follow=True),)
def parse_obj(self,response):
for link in LxmlLinkExtractor(allow_domains=(self.domain)).extract_links(response):
item = MyItem()
item['url'] = link.url
if self.dic.has_key(item['url']):
continue
global f
global num
f.write(item['url']+"\n")
self.dic[item['url']]=True
self.i+=1
if self.i%1000000==0:
self.j+=1
f.close()
f=open("items"+str(self.j)+".txt","w")
num.write(str(self.j+1)+"\n")
Does anybody have a better method to scrape?
How many log files do you estimate my scrapy script will take from a website like this?
Upvotes: 0
Views: 453
Reputation: 5272
Scrapy drop duplicate request by DUPEFILTER_CLASS, the default setting is RFPDupeFilter, which is similar like your method but not save seen urls to many files.
I have created a POC.
# -*- coding: utf-8 -*-
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
class ExampleSpider(CrawlSpider):
name = "ExampleSpider"
allowed_domains = ["www.example.com", "www.iana.org"]
start_urls = (
'http://www.example.com/',
)
rules = (Rule(LxmlLinkExtractor(allow_domains=allowed_domains), callback='parse_obj', follow=True),)
log_file = open('test.log', 'a')
def parse_obj(self, response):
#self.logger.info(response.url)
self.logger.info(self.settings['DUPEFILTER_CLASS'])
self.log_file.write(response.url + '\n')
Run it with scrapy crawl ExampleSpider -s DUPEFILTER_DEBUG=1
, there should be some debug info like following.
[scrapy] DEBUG: Filtered duplicate request: <GET http://www.iana.org/about/framework>
Upvotes: 1