Reputation: 421
I am trying to add rotating proxy Scrapy Playwright. scrapy-proxy-pool does not work well with Scrapy Playwright. So I hacked https://github.com/rejoiceinhope/scrapy-proxy-pool and found out that it uses https://pypi.org/project/proxyscrape/ to build rotating proxy mechanism.
Trying to debug this for hours. But I think there is some techinial mistake I am making. Beacuse of which it is showing connection error with the proxy server and then it show timeout error.
My Code:
import scrapy
from scrapy_playwright.page import PageMethod
from proxyscrape import create_collector
collector = create_collector('proxy', 'http')
class ProxySpider(scrapy.Spider):
name = 'proxy'
PLAYWRIGHT_LAUNCH_OPTIONS = {
"headless": False,
"timeout": 100 * 1000, # 20 seconds
}
def start_requests(self):
proxy = collector.get_proxy()
print("Proxy --> http://"+proxy.host+":"+proxy.port)
yield scrapy.Request("http://httpbin.org/get", meta={
"playwright": True,
"playwright_context_kwargs": {
"java_script_enabled": True,
"ignore_https_errors": True,
"proxy": {
"server": "http://"+proxy.host+":"+proxy.port,
},
},
})
def parse(self,response):
print(response.text)
Error:
File "/home/sappy/.virtualenvs/121-server/lib/python3.10/site-packages/scrapy_playwright/handler.py", line 297, in _download_request result = await self._download_request_with_page(request, page, spider) File "/home/sappy/.virtualenvs/121-server/lib/python3.10/site-packages/scrapy_playwright/handler.py", line 331, in _download_request_with_page response = await page.goto(url=request.url, **page_goto_kwargs) File "/home/sappy/.virtualenvs/121-server/lib/python3.10/site-packages/playwright/async_api/_generated.py", line 9162, in goto await self._impl_obj.goto( File "/home/sappy/.virtualenvs/121-server/lib/python3.10/site-packages/playwright/_impl/_page.py", line 494, in goto return await self._main_frame.goto(**locals_to_params(locals())) File "/home/sappy/.virtualenvs/121-server/lib/python3.10/site-packages/playwright/_impl/_frame.py", line 147, in goto await self._channel.send("goto", locals_to_params(locals())) File "/home/sappy/.virtualenvs/121-server/lib/python3.10/site-packages/playwright/_impl/_connection.py", line 44, in send return await self._connection.wrap_api_call( File "/home/sappy/.virtualenvs/121-server/lib/python3.10/site-packages/playwright/_impl/_connection.py", line 419, in wrap_api_call return await cb() File "/home/sappy/.virtualenvs/121-server/lib/python3.10/site-packages/playwright/_impl/_connection.py", line 79, in inner_send result = next(iter(done)).result() playwright._impl._api_types.Error: net::ERR_TIMED_OUT at http://httpbin.org/get =========================== logs =========================== navigating to "http://httpbin.org/get", waiting until "load" ============================================================
Upvotes: 2
Views: 1891
Reputation: 2727
I found this snippet for using proxies with playwright
. Maybe it will help you.
from scrapy import Spider, Request
class ProxySpider(Spider):
name = "proxy"
custom_settings = {
"PLAYWRIGHT_LAUNCH_OPTIONS": {
"proxy": {
"server": "http://myproxy.com:3128"
"username": "user",
"password": "pass",
},
}
}
def start_requests(self):
yield Request("http://httpbin.org/get", meta={"playwright": True})
def parse(self, response):
print(response.text)
Upvotes: 0