Om Prakash
Om Prakash

Reputation: 2881

How to add random user agent to scrapy spider when calling spider from script?

I want to add random user agent to every request for a spider being called by other script. My implementation is as follows:

CoreSpider.py

from scrapy.spiders import Rule
import ContentHandler_copy 

class CoreSpider(scrapy.Spider):
name = "final"
def __init__(self):
    self.start_urls = self.read_url()
    self.rules = (
        Rule(
            LinkExtractor(
                unique=True,
            ),
            callback='parse',
            follow=True
        ),
    )


def read_url(self):
    urlList = []
    for filename in glob.glob(os.path.join("/root/Public/company_profiler/seed_list", '*.list')):
        with open(filename, "r") as f:
            for line in f.readlines():
                url = re.sub('\n', '', line)
                if "http" not in url:
                    url = "http://" + url
                # print(url)
                urlList.append(url)

    return urlList

def parse(self, response):
    print("URL is: ", response.url)
    print("User agent is : ", response.request.headers['User-Agent'])
    filename = '/root/Public/company_profiler/crawled_page/%s.html' % response.url
    article = Extractor(extractor='LargestContentExtractor', html=response.body).getText()
    print("Article is :", article)
    if len(article.split("\n")) < 5:
        print("Skipping to next url : ", article.split("\n"))
    else:
        print("Continue parsing: ", article.split("\n"))
        ContentHandler_copy.ContentHandler_copy.start(article, response.url)

I am running this spider from a script as follows by RunSpider.py

from CoreSpider import CoreSpider
from scrapy.crawler import  CrawlerProcess



process = CrawlerProcess()
process.crawl(CoreSpider())
process.start()

It works fine, now I want to randomly use different user-agent for each request. I have successfully used random user-agent for scrapy project, but unable to integrate with this spider when calling this spider from other script.

My settings.py for working scrapy project -

BOT_NAME = 'tutorial'

SPIDER_MODULES = ['tutorial.spiders']
NEWSPIDER_MODULE = 'tutorial.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'tutorial (+http://www.yourdomain.com)'

DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'random_useragent.RandomUserAgentMiddleware': 320
}

USER_AGENT_LIST = "tutorial/user-agent.txt"

How can I tell my CoreSpider.py to use this setting.py configuration programmatically?

Upvotes: 1

Views: 2918

Answers (1)

Tom&#225;š Linhart
Tom&#225;š Linhart

Reputation: 10210

Take a look in the documentation, specifically Common Practices. You can supply settings as an argument to CrawlProcess constructor. Or, if you use Scrapy project and want to take settings from settings.py, you can do it like this:

...
from scrapy.utils.project import get_project_settings    
process = CrawlerProcess(get_project_settings())
...

Upvotes: 3

Related Questions