user5616520
user5616520

Reputation: 91

Scrapy - access data while crawling and randomly change user agent

Is it possible to access the data while scrapy is crawling ? I have a script that finds a specific keyword and writes the keyword in .csv as well as the link where it was found. However, I have to wait for scrapy to be done crawling, and when that is done it actually outputs the data in the .csv file

I'm also trying to change my user agent randomly, but it's not working. If I'm not allowed for two questions in one, i will post this as a separate question.

#!/usr/bin/env python
# -*- coding: utf-8 -*- 
from scrapy.spiders import Spider
from scrapy import log
from FinalSpider.items import Page
from FinalSpider.settings import USER_AGENT_LIST
from FinalSpider.settings import DOWNLOADER_MIDDLEWARES

import random
import telnetlib
import time
 
 
class FinalSpider(Spider):
    name = "FinalSpider"
    allowed_domains = ['url.com']
    start_urls = ['url.com=%d' %(n)
              for n in xrange(62L, 62L)]


    def parse(self, response):
        item = Page()

        item['URL'] = response.url
        item['Stake'] = ''.join(response.xpath('//div[@class="class"]//span[@class="class" or @class="class"]/text()').extract())
        if item['cur'] in [u'50,00', u'100,00']:
            return item

# 30% useragent change
class RandomUserAgentMiddleware(object):
    def process_request(self, request, spider):
        if random.choice(xrange(1,100)) <= 30:
            log.msg('Changing UserAgent')
            ua  = random.choice(USER_AGENT_LIST)
            if ua:
                request.headers.setdefault('User-Agent', ua)
            log.msg('>>>> UserAgent changed')

Upvotes: 0

Views: 543

Answers (1)

eLRuLL
eLRuLL

Reputation: 18799

You are not obliged to output your collected items (aka "data") into a csv file, you can only run scrapy with:

scrapy crawl myspider

This will be outputting the logs into the terminal, but for storing just the items into a csv file I assume you are doing something like this:

scrapy crawl myspider -o items.csv

Now if you want to store the logs and the items, I suggest you go put this into your settings.py file:

LOG_FILE = "logfile.log"

now you can see something while the spider runs just checking that file.

For your problem with the randomuseragent, please check how to activate scrapy middlewares.

Upvotes: 1

Related Questions