saprative
saprative

Reputation: 421

How to configure rotating proxy with scrapy playwright?

I am trying to add rotating proxy Scrapy Playwright. scrapy-proxy-pool does not work well with Scrapy Playwright. So I hacked https://github.com/rejoiceinhope/scrapy-proxy-pool and found out that it uses https://pypi.org/project/proxyscrape/ to build rotating proxy mechanism.

Trying to debug this for hours. But I think there is some techinial mistake I am making. Beacuse of which it is showing connection error with the proxy server and then it show timeout error.

My Code:

import scrapy
from scrapy_playwright.page import PageMethod
from proxyscrape import create_collector

collector = create_collector('proxy', 'http')

class ProxySpider(scrapy.Spider):
    name = 'proxy'
    
    PLAYWRIGHT_LAUNCH_OPTIONS = {
        "headless": False,
        "timeout": 100 * 1000,  # 20 seconds
    
    }
    
    def start_requests(self):
        proxy = collector.get_proxy()
        print("Proxy --> http://"+proxy.host+":"+proxy.port)
      
        yield scrapy.Request("http://httpbin.org/get", meta={
            "playwright": True,
            "playwright_context_kwargs": {
                "java_script_enabled": True,
                "ignore_https_errors": True,
                "proxy": {
                    "server": "http://"+proxy.host+":"+proxy.port,
                },
            },
            })       
    
    def parse(self,response):
        print(response.text)

Error:

 File "/home/sappy/.virtualenvs/121-server/lib/python3.10/site-packages/scrapy_playwright/handler.py", line 297, in _download_request                                                                                                                                        result = await self._download_request_with_page(request, page, spider)                                                             File "/home/sappy/.virtualenvs/121-server/lib/python3.10/site-packages/scrapy_playwright/handler.py", line 331, in _download_request_with_page                                                                                                                              response = await page.goto(url=request.url, **page_goto_kwargs)                                                                    File "/home/sappy/.virtualenvs/121-server/lib/python3.10/site-packages/playwright/async_api/_generated.py", line 9162, in goto         await self._impl_obj.goto(                                                                                                         File "/home/sappy/.virtualenvs/121-server/lib/python3.10/site-packages/playwright/_impl/_page.py", line 494, in goto                   return await self._main_frame.goto(**locals_to_params(locals()))                                                                   File "/home/sappy/.virtualenvs/121-server/lib/python3.10/site-packages/playwright/_impl/_frame.py", line 147, in goto                  await self._channel.send("goto", locals_to_params(locals()))                                                                       File "/home/sappy/.virtualenvs/121-server/lib/python3.10/site-packages/playwright/_impl/_connection.py", line 44, in send              return await self._connection.wrap_api_call(                                                                                       File "/home/sappy/.virtualenvs/121-server/lib/python3.10/site-packages/playwright/_impl/_connection.py", line 419, in wrap_api_call    return await cb()                                                                                                                  File "/home/sappy/.virtualenvs/121-server/lib/python3.10/site-packages/playwright/_impl/_connection.py", line 79, in inner_send        result = next(iter(done)).result()                                                                                               playwright._impl._api_types.Error: net::ERR_TIMED_OUT at http://httpbin.org/get                                                      =========================== logs ===========================                                                                         navigating to "http://httpbin.org/get", waiting until "load"                                                                         ============================================================  

Upvotes: 2

Views: 1891

Answers (1)

LLaP
LLaP

Reputation: 2727

I found this snippet for using proxies with playwright. Maybe it will help you.

from scrapy import Spider, Request

class ProxySpider(Spider):
    name = "proxy"
    custom_settings = {
        "PLAYWRIGHT_LAUNCH_OPTIONS": {
            "proxy": {
                "server": "http://myproxy.com:3128"
                "username": "user",
                "password": "pass",
            },
        }
    }

    def start_requests(self):
        yield Request("http://httpbin.org/get", meta={"playwright": True})

    def parse(self, response):
        print(response.text)

Upvotes: 0

Related Questions