Tim
Tim

Reputation: 201

Scrape different URL's with different user agents and IP Addresses

I have a program that needs to scrape several different urls using scrapy and I need it to use the same user agent and IP address for each url. So if I am scraping like 50 urls I need each url to have one unique user agent and ip address that are only used when scraping that url. But the IP address and user agent get changed when the program scrapes the next url.

I have already got it to rotate user agents randomly but now I just need to pair user agents with different urls and use those same user agents with same urls each time. As for the IP addresses I cannot even get it to rotate them randomly let alone pair them with one unique url.

SplashSpider.py

from scrapy.spiders import Spider
from scrapy_splash import SplashRequest
from ..items import GameItem

class MySpider(Spider):
        name = 'splash_spider' # Name of Spider
        start_urls = [''] # url(s)
#
#
#
#......
# all the urls I need to scrape, 50+ will go in these lines
        def start_requests(self):
                for url in self.start_urls:
                        yield SplashRequest(url=url, callback=self.parse, args={"wait": 3})
        #Scraping
        def parse(self, response):
                item = GameItem()
                for game in response.css(""): #loop to go through contents of webpage until all needed info is scrapped 
                    # Card Name
                    item["card name"] = game.css("").extract_first() #html code corresponding to card name
                    # Price
                    item["Price"] = game.css("td.deckdbbody.search_results_9::text").extract_first() #code corresponding to price
                    yield item            

settings.py


SPIDER_MODULES = ['scrapy_javascript.spiders'] NEWSPIDER_MODULE = 'scrapy_javascript.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'scrapy_javascript (http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

DOWNLOADER_MIDDLEWARES = {
        'scrapy_splash.SplashCookiesMiddleware': 723,
        'scrapy_splash.SplashMiddleware': 725,
        'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, } # -----------------------------------------------------------------------------
# USER AGENT
# -----------------------------------------------------------------------------
DOWNLOADER_MIDDLEWARES = {
        'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
        'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400, }

USER_AGENTS = [
   #Chrome
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
    'Mozilla/5.0 (Windows NT 5.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36',
    #Firefox
    'Mozilla/4.0 (compatible; MSIE 9.0; Windows NT 6.1)',
    'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)',
    'Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (Windows NT 6.2; WOW64; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0)',
    'Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)',
    'Mozilla/5.0 (Windows NT 6.1; Win64; x64; Trident/7.0; rv:11.0) like Gecko',
    'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)',
    'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/6.0)',
    'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)'
]


# -----------------------------------------------------------------------------
# IP ADDRESSES
# -----------------------------------------------------------------------------
PROXY_POOL_ENABLED = True
DOWNLOADER_MIDDLEWARES = {
    'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
    'rotating_proxies.middlewares.BanDetectionMiddleware': 620,  }

ROTATING_PROXY_LIST = [
    'http://199.89.192.76:8050'
    'http://199.89.192.77:8050'
    'http://199.89.192.78:8050'
    'http://199.89.193.2:8050'
    'http://199.89.193.3:8050'
    'http://199.89.193.4:8050'
    'http://199.89.193.5:8050'
    'http://199.89.193.6:8050'
    'http://199.89.193.7:8050'
    'http://199.89.193.8:8050'
    'http://199.89.193.9:8050'
    'http://199.89.193.10:8050'
    'http://199.89.193.11:8050'
    'http://199.89.193.12:8050'
    'http://199.89.193.13:8050'
    'http://199.89.193.14:8050'
    'http://204.152.114.226:8050'
    'http://204.152.114.227:8050'
    'http://204.152.114.228:8050'
    'http://204.152.114.229:8050'
    'http://204.152.114.230:8050'
    'http://204.152.114.232:8050'
    'http://204.152.114.233:8050'
    'http://204.152.114.234:8050'
    'http://204.152.114.235:8050'
    'http://204.152.114.236:8050'
    'http://204.152.114.237:8050'
    'http://204.152.114.238:8050'
    'http://204.152.114.239:8050'
    'http://204.152.114.240:8050'
    'http://204.152.114.241:8050'
    'http://204.152.114.242:8050'
    'http://204.152.114.243:8050'
    'http://204.152.114.244:8050'
    'http://204.152.114.245:8050'
    'http://204.152.114.246:8050'
    'http://204.152.114.247:8050'
    'http://204.152.114.248:8050'
    'http://204.152.114.249:8050'
    'http://204.152.114.250:8050'
    'http://204.152.114.251:8050'
    'http://204.152.114.252:8050'
    'http://204.152.114.253:8050'
    'http://204.152.114.254:8050'
    ]
SPLASH_URL = 'http://199.89.192.74:8050'
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

In the end it should simply pair each of the urls I need to scrape with a ip address and user agent for the lists I have in my settings.py file.

Upvotes: 1

Views: 875

Answers (1)

Granitosaurus
Granitosaurus

Reputation: 21436

This is bit out of the scope of a simple stackoverflow question.

However the general approach when dealing with customizing requests sent out by a scrapy crawler is to write a downloader middleware[1].

In your example you want to write a downloader middleware that would:

1. Generate profiles on spider start by making a list of `(ip, user-agent)` tuples
2. Make a round-robing (or alternative) queue of these profiles
3. Adjust every sent request with one random profile

Briefly as code it would look like this:

# middlewares.py
import random
from copy import copy

class ProfileMiddleware:

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        mw = cls(crawler, *args, **kwargs)
        crawler.signals.connect(mw.spider_opened, signal=signals.spider_opened)
        mw.settings = crawler.settings
        return mw

    def spider_opened(self, spider, **kwargs):
        proxies = self.settings.getlist('PROXIES')
        user_agents = self.settings.getlist('USER_AGENTS')
        self.profiles = list(zip(proxies, user_agents))
        self.queue = copy(self.profiles)
        random.shuffle(self.queue)

    def process_request(self, request, spider):
        if not self.queue:
            self.queue = copy(self.profiles)
            random.shuffle(self.queue)

        profile = self.queue.pop()
        request.headers['User-Agent'] = profile[1]
        request.meta['proxy'] = profile[0]

I haven't tested this, it's just to illustrate general idea

Then activate it, somewhere at the end of middleware chain:

# settings.py
DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.ProfileMiddleware': 900, 
}
PROXIES = ['123', '456'...]
USER_AGENTS = ['firefox', 'chrome'...]

1 - More on scrapy's downloader middlewares: https://docs.scrapy.org/en/latest/topics/downloader-middleware.html

Upvotes: 3

Related Questions