Python: With Scrapy Script- Is this the best way to scrape urls from forums?

Question

What I want to do:

Scrape all urls from this website: http://www.captainluffy.net/ (my friends website, who I have permission to scrape urls from)
However, I can't just brute everything, as I'll end up with lots of duplicate links (98% being duplicates)
Even if I make my log file only contain unique urls, it still could be a couple of million links (which will take quite some time to get).

I can pause/resume my scrappy script thanks to this: http://doc.scrapy.org/en/latest/topics/jobs.html

I've set the script so it splits every 1,000,000 records.
And the Python dictionary only checks url keys for duplicates within each text file. So at the very least, the urls within each file will be unique. If I had a bigger dictionary. it would tremendously slow down the process IMO. Having 1 duplicate (every 1,000,000 logs) is better than thousands.

This is the Python script code I'm currently using:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.item import Item, Field

class MyItem(Item):
	url=Field()


f=open("items0"+".txt","w")
num=open("number.txt","w")
class someSpider(CrawlSpider):
  name = "My script"
  domain=raw_input("Enter the domain:
")
  allowed_domains = [domain]
  starting_url=raw_input("Enter the starting url with protocol:
")
  start_urls = [starting_url]
  i=0
  j=0
  dic={}
  global f

  rules = (Rule(LxmlLinkExtractor(allow_domains=(domain)), callback='parse_obj', follow=True),)


  def parse_obj(self,response):
    for link in LxmlLinkExtractor(allow_domains=(self.domain)).extract_links(response):
        item = MyItem()
        item['url'] = link.url
        if self.dic.has_key(item['url']):
          continue
        global f
        global num
        f.write(item['url']+"
")
        self.dic[item['url']]=True
        self.i+=1
        if self.i%1000000==0:
          self.j+=1
          f.close()
          f=open("items"+str(self.j)+".txt","w")
          num.write(str(self.j+1)+"
")

Does anybody have a better method to scrape?
How many log files do you estimate my scrapy script will take from a website like this?

Python: With Scrapy Script- Is this the best way to scrape urls from forums?

Answers (1)

Related Questions