ML_Enthousiast
ML_Enthousiast

Reputation: 1269

Configure random proxies with R for scraping

I scraped a website authorizing scraping in robots rules but sometimes I get blocked.

While I contacted the admin to understand why, I want to understand how I can use different proxies within R to keep on scraping without being blocked.

I followed this quick tutorial: https://support.rstudio.com/hc/en-us/articles/200488488-Configuring-R-to-Use-an-HTTP-or-HTTPS-Proxy

So I edited the environment file:

file.edit('~/.Renviron')

and within this I inserted a list of proxies to be selected randomly:

proxies_list <- c("128.199.109.241:8080","113.53.230.195:3128","125.141.200.53:80","125.141.200.14:80","128.199.200.112:138","149.56.123.99:3128","128.199.200.112:80","125.141.200.39:80","134.213.29.202:4444")
proxy <-paste0('https://', sample(proxies_list, 1))
https_proxy=proxy 

But when I scrape with this code:

download.file(url_proxy, destfile ='output.html',quiet = TRUE)
html_output <- read_html('output.html')

I keep being blocked.

Am I not setting the proxies correctly?

Thanks ! M.

Upvotes: 1

Views: 1288

Answers (1)

Neal Fultz
Neal Fultz

Reputation: 9687

You need to set environment variables, not R variables. See ?download.file for more details.

eg

Sys.setenv(http_proxy=proxy)

before anything else happens. Also note the warning in the docs:

These environment variables must be set before the download code is
first used: they cannot be altered later by calling 'Sys.setenv'.

Upvotes: 1

Related Questions